1. Problem Statement¶
Our project’s goal is to better understand energy consumption patterns and identify factors potentially affecting energy efficiency in Cambridge buildings. Insights gained from this project can be used to propose data-driven recommendations for reducing energy use and working towards ESG goals in Cambridge.
Using our model, we can predict energy consumption based on property data and provide actionable recommendations to reduce energy usage by identifying key contributing factors.
We use two data sources for the analysis.
- Cambridge Building Energy Use data obtained from the Cambridge Open Data Portal. It contains energy and water use data gathered under the BEUDO program for years 2015-2022. The data is at the parcel level. Properties subject to BEUDO reporting include large residential (50+ units), nonresidential (25,000+ sq. ft.), and municipal (10,000+ sq. ft.) buildings.
- Cambridge Property Database data also obtained from the Cambridge Open Data Portal, here. It contains physical details of the properties, including wall types, interior living space, number of kitchens, and much more. The data covers fiscal years 2016 through 2024. The data is at the building or unit level, necessitating aggregation for our purposes.
We chose to use weather-normalized site energy use intensity (Weather normalized Site EUI (kBtu/ft2)) as our response variable. This variable normalizes for building size and yearly weather variance, allowing us to better focus on the impact of materials, age, and other physical property details.
2. Setup¶
# Load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import missingno as msno
from collections import Counter
from scipy.stats import chi2_contingency
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from statsmodels.graphics.gofplots import qqplot
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingRegressor
import shap
from sklearn.inspection import permutation_importance
3. Load and Preview Data¶
# Load data
beudo_url = 'Cambridge_Building_Energy_Use_Disclosure_Ordinance__BEUDO__Data_2015-2022_20241025.csv'
property_db_url = 'Cambridge_Property_Database_FY2016-FY2024_20241112.csv'
energy_data = pd.read_csv(beudo_url)
property_data = pd.read_csv(property_db_url)
# Display basic information about the datasets
print(energy_data.info())
print(property_data.info())
print("\nEnergy data basic stats:")
print(energy_data.describe())
print("\nProperty data basic stats:\n")
print(property_data.describe())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6823 entries, 0 to 6822
Data columns (total 46 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Reporting ID 6823 non-null object
1 Data Year 6823 non-null int64
2 MapLot 6823 non-null object
3 Annual Report Received 6823 non-null object
4 BEUDO Category 6823 non-null object
5 PD Parcel Living Area 6820 non-null float64
6 PD Parcel Units 5424 non-null float64
7 Address 6823 non-null object
8 Year Built 6535 non-null float64
9 Buildings Included 6823 non-null object
10 Buildings Included Count 6823 non-null int64
11 Primary Property Type - Self Selected 6532 non-null object
12 All Property Uses 6475 non-null object
13 Property GFA - Self Reported (ft2) 6537 non-null float64
14 Reported Residential Units 1320 non-null float64
15 Owner 6823 non-null object
16 Owner Line 2 2255 non-null object
17 ENERGY STAR Score 2658 non-null float64
18 Electricity Use - Grid Purchase (kWh) 5772 non-null float64
19 Electricity Use - Grid Purchase (kBtu) 5756 non-null float64
20 Natural Gas Use (therms) 4753 non-null float64
21 Natural Gas Use (kBtu) 4753 non-null float64
22 Fuel Oil #1 Use (kBtu) 0 non-null float64
23 Fuel Oil #2 Use (kBtu) 75 non-null float64
24 Fuel Oil #4 Use (kBtu) 0 non-null float64
25 Fuel Oil #5 & 6 Use (kBtu) 0 non-null float64
26 Diesel #2 Use (kBtu) 53 non-null float64
27 Kerosene Use (kBtu) 8 non-null float64
28 District Chilled Water Use (kBtu) 1111 non-null float64
29 District Steam Use (kBtu) 1686 non-null float64
30 Electricity Use - Generated From Onsite Renewable Systems (kWh) 172 non-null float64
31 Site Energy Use (kBtu) 5776 non-null float64
32 Weather Normalized Site Energy Use (kBtu) 5097 non-null float64
33 Site EUI (kBtu/ft2) 5746 non-null float64
34 Weather Normalized Site EUI (kBtu/ft2) 5089 non-null float64
35 Source Energy Use (kBtu) 5776 non-null float64
36 Weather Normalized Source Energy Use (kBtu) 5097 non-null float64
37 Source EUI (kBtu/ft2) 5747 non-null float64
38 Weather Normalized Source EUI (kBtu/ft2) 5089 non-null float64
39 Total GHG Emissions (Metric Tons CO2e) 6246 non-null float64
40 Total GHG Emissions Intensity (kgCO2e/ft2) 6113 non-null float64
41 Water Use (All Water Sources) (kgal) 5299 non-null float64
42 Water Intensity (All Water Sources) (gal/ft2) 5074 non-null float64
43 Latitude 6823 non-null float64
44 Longitude 6823 non-null float64
45 Location 6823 non-null object
dtypes: float64(33), int64(2), object(11)
memory usage: 2.4+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236801 entries, 0 to 236800
Data columns (total 62 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PID 236801 non-null int64
1 GISID 236689 non-null object
2 BldgNum 236801 non-null int64
3 Address 236801 non-null object
4 Unit 135901 non-null object
5 Latitude 233141 non-null float64
6 Longitude 233141 non-null float64
7 StateClassCode 236801 non-null int64
8 PropertyClass 236801 non-null object
9 Zoning 106009 non-null object
10 Map/Lot 236801 non-null object
11 LandArea 236801 non-null int64
12 YearOfAssessment 236801 non-null int64
13 TaxDistrict 236792 non-null object
14 ResidentialExemption 236801 non-null bool
15 BuildingValue 236801 non-null float64
16 LandValue 236801 non-null float64
17 AssessedValue 236801 non-null float64
18 SalePrice 236801 non-null float64
19 Book/Page 236793 non-null object
20 SaleDate 236265 non-null object
21 PreviousAssessedValue 236801 non-null float64
22 Owner_Name 236788 non-null object
23 Owner_CoOwnerName 89353 non-null object
24 Owner_Address 236702 non-null object
25 Owner_Address2 19200 non-null object
26 Owner_City 236730 non-null object
27 Owner_State 236664 non-null object
28 Owner_Zip 236673 non-null object
29 Exterior_Style 158569 non-null object
30 Exterior_occupancy 95817 non-null object
31 Exterior_NumStories 236801 non-null float64
32 Exterior_WallType 95628 non-null object
33 Exterior_WallHeight 26827 non-null float64
34 Exterior_RoofType 69014 non-null object
35 Exterior_RoofMaterial 95531 non-null object
36 Exterior_FloorLocation 131145 non-null float64
37 Exterior_View 135829 non-null object
38 Interior_LivingArea 236801 non-null int64
39 Interior_NumUnits 100966 non-null float64
40 Interior_TotalRooms 209974 non-null float64
41 Interior_Bedrooms 209974 non-null float64
42 Interior_Kitchens 201584 non-null object
43 Interior_FullBaths 209974 non-null float64
44 Interior_HalfBaths 209974 non-null float64
45 Interior_Fireplaces 209974 non-null float64
46 Interior_Flooring 125456 non-null object
47 Interior_Layout 130887 non-null object
48 Interior_LaundryInUnit 135835 non-null object
49 Systems_HeatType 224875 non-null object
50 Systems_HeatFuel 91692 non-null object
51 Systems_CentralAir 74139 non-null object
52 Systems_Plumbing 26775 non-null object
53 Condition_YearBuilt 236801 non-null int64
54 Condition_InteriorCondition 69016 non-null object
55 Condition_OverallCondition 231247 non-null object
56 Condition_OverallGrade 231656 non-null object
57 Parking_Open 209974 non-null float64
58 Parking_Covered 209974 non-null float64
59 Parking_Garage 135835 non-null float64
60 UnfinishedBasementGross 236801 non-null int64
61 FinishedBasementGross 209974 non-null float64
dtypes: bool(1), float64(20), int64(8), object(33)
memory usage: 110.4+ MB
None
Energy data basic stats:
Data Year PD Parcel Living Area PD Parcel Units Year Built \
count 6823.000000 6.820000e+03 5424.000000 6535.000000
mean 2018.508427 4.829514e+05 94.719764 1960.496404
std 2.291536 8.795619e+05 218.550142 46.165984
min 2015.000000 8.625000e+03 0.000000 1718.000000
25% 2017.000000 5.659400e+04 1.000000 1930.000000
50% 2019.000000 1.191090e+05 6.000000 1970.000000
75% 2021.000000 3.431340e+05 83.000000 1998.000000
max 2022.000000 3.691096e+06 1027.000000 2022.000000
Buildings Included Count Property GFA - Self Reported (ft2) \
count 6823.000000 6537.000000
mean 1.388685 110153.319107
std 1.656398 113137.003979
min 1.000000 761.000000
25% 1.000000 34000.000000
50% 1.000000 67362.000000
75% 1.000000 145275.000000
max 23.000000 759418.000000
Reported Residential Units ENERGY STAR Score \
count 1320.000000 2658.000000
mean 118.444227 64.873589
std 104.398047 29.777797
min 3.000000 1.000000
25% 55.000000 46.000000
50% 84.000000 73.000000
75% 166.000000 90.000000
max 565.000000 100.000000
Electricity Use - Grid Purchase (kWh) \
count 5.772000e+03
mean 1.838152e+06
std 3.007818e+06
min -2.038940e+04
25% 2.024091e+05
50% 6.576856e+05
75% 1.980930e+06
max 2.633504e+07
Electricity Use - Grid Purchase (kBtu) ... Source Energy Use (kBtu) \
count 5.756000e+03 ... 5.776000e+03
mean 6.279994e+06 ... 2.736149e+07
std 1.027202e+07 ... 4.887582e+07
min -6.956860e+04 ... 0.000000e+00
25% 6.913733e+05 ... 3.456744e+06
50% 2.245147e+06 ... 1.026384e+07
75% 6.767338e+06 ... 2.954168e+07
max 8.985517e+07 ... 6.917437e+08
Weather Normalized Source Energy Use (kBtu) Source EUI (kBtu/ft2) \
count 5.097000e+03 5747.000000
mean 2.628636e+07 227.611895
std 4.883734e+07 311.880230
min 0.000000e+00 0.000000
25% 3.305353e+06 86.200000
50% 9.306082e+06 131.800000
75% 2.735109e+07 260.700000
max 6.847265e+08 11310.600000
Weather Normalized Source EUI (kBtu/ft2) \
count 5089.000000
mean 220.914448
std 314.680210
min 0.000000
25% 86.500000
50% 128.400000
75% 239.300000
max 11312.900000
Total GHG Emissions (Metric Tons CO2e) \
count 6246.000000
mean 873.560439
std 1840.262547
min 0.000000
25% 67.700000
50% 273.300000
75% 945.500000
max 31623.800000
Total GHG Emissions Intensity (kgCO2e/ft2) \
count 6113.000000
mean 7.389377
std 10.296182
min 0.000000
25% 2.700000
50% 4.400000
75% 8.100000
max 288.900000
Water Use (All Water Sources) (kgal) \
count 5299.000000
mean 3405.057615
std 11306.779564
min 0.000000
25% 280.900000
50% 1162.200000
75% 3403.600000
max 359303.600000
Water Intensity (All Water Sources) (gal/ft2) Latitude Longitude
count 5074.000000 6823.000000 6823.000000
mean 42.406732 42.371642 -71.109055
std 632.077186 0.010310 0.019323
min 0.000000 42.353952 -71.157365
25% 6.550000 42.363389 -71.120259
50% 16.005000 42.370770 -71.111241
75% 30.965000 42.377251 -71.093183
max 41228.180000 42.403458 -71.068721
[8 rows x 35 columns]
Property data basic stats:
PID BldgNum Latitude Longitude \
count 236801.000000 236801.000000 233141.000000 233141.000000
mean 51707.812948 1.148289 42.375558 -71.112977
std 74481.896038 1.509807 0.010417 0.019560
min 3.000000 1.000000 42.353300 -71.160000
25% 7482.000000 1.000000 42.368600 -71.127500
50% 14724.000000 1.000000 42.372800 -71.112000
75% 22171.000000 1.000000 42.382800 -71.101600
max 202676.000000 38.000000 42.403000 -71.067200
StateClassCode LandArea YearOfAssessment BuildingValue \
count 236801.000000 2.368010e+05 236801.000000 2.368010e+05
mean 319.410801 1.567271e+04 2020.030122 5.909026e+06
std 927.667903 1.347158e+05 2.583882 5.450378e+07
min 13.000000 0.000000e+00 2016.000000 -1.419000e+06
25% 102.000000 0.000000e+00 2018.000000 4.604000e+05
50% 102.000000 0.000000e+00 2020.000000 6.562000e+05
75% 105.000000 3.699000e+03 2022.000000 9.619000e+05
max 9971.000000 7.251433e+06 2024.000000 1.124730e+09
LandValue AssessedValue ... Interior_Bedrooms \
count 2.368010e+05 2.368010e+05 ... 209974.00000
mean 4.214107e+06 1.012313e+07 ... 2.73472
std 4.522068e+07 9.392923e+07 ... 4.80240
min 0.000000e+00 0.000000e+00 ... 0.00000
25% 0.000000e+00 5.838000e+05 ... 2.00000
50% 0.000000e+00 8.654000e+05 ... 2.00000
75% 7.026000e+05 1.488000e+06 ... 3.00000
max 1.086314e+09 1.615000e+09 ... 462.00000
Interior_FullBaths Interior_HalfBaths Interior_Fireplaces \
count 209974.000000 209974.000000 209974.000000
mean 1.763609 0.273948 0.300480
std 3.490979 0.620907 0.825363
min 0.000000 0.000000 0.000000
25% 1.000000 0.000000 0.000000
50% 2.000000 0.000000 0.000000
75% 2.000000 0.000000 0.000000
max 321.000000 20.000000 24.000000
Condition_YearBuilt Parking_Open Parking_Covered Parking_Garage \
count 236801.000000 209974.000000 209974.000000 135835.000000
mean 1874.419441 0.683723 0.141675 0.348415
std 314.040535 1.173049 0.570949 1.198971
min 0.000000 0.000000 0.000000 0.000000
25% 1894.000000 0.000000 0.000000 0.000000
50% 1915.000000 0.000000 0.000000 0.000000
75% 1965.000000 1.000000 0.000000 1.000000
max 10902.000000 80.000000 45.000000 99.000000
UnfinishedBasementGross FinishedBasementGross
count 236801.000000 209974.000000
mean 470.456510 84.351977
std 1543.078049 259.994912
min -1125.000000 -950.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 672.000000 0.000000
max 77288.000000 6838.000000
[8 rows x 28 columns]
On first look, we can see that the energy use dataset has 48 columns and about 6800 observations. There are many missing values; three columns have no non-null values at all. Fortunately, the initial ID columns are fully intact.
The property dataset has 62 columns and almost 237,000 observations. There are again many missing values. Some of the columns with many missing values are irrelevant to our model, such as the co-owner's address, but there are others like Systems_Plumbing that should be widely relevant but have very few non-null values.
We already know that the two datasets are at different scales, hence the huge difference in observation counts. Aggregation or disaggregation will be vital here.
We won't comment on specific column summary statistics here, but we can see that many of the columns are heavily right-skewed.
# Preview energy data
print('Preview energy data')
energy_data.head()
Preview energy data
| Reporting ID | Data Year | MapLot | Annual Report Received | BEUDO Category | PD Parcel Living Area | PD Parcel Units | Address | Year Built | Buildings Included | ... | Weather Normalized Source Energy Use (kBtu) | Source EUI (kBtu/ft2) | Weather Normalized Source EUI (kBtu/ft2) | Total GHG Emissions (Metric Tons CO2e) | Total GHG Emissions Intensity (kgCO2e/ft2) | Water Use (All Water Sources) (kgal) | Water Intensity (All Water Sources) (gal/ft2) | Latitude | Longitude | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B1074 | 2015 | 95-73 | Yes | Non-Residential | 44720.0 | 1.0 | 130 Brookline Street | 2010.0 | 694-4 | ... | NaN | 801.1 | NaN | 1432.9 | 26.4 | 4754.0 | 87.64 | 42.360313 | -71.104509 | POINT (-71.1045085999411 42.3603129493224) |
| 1 | B1420 | 2015 | 7-115 | No | Non-Residential | 57035.0 | 123.0 | 215 Monsignor O'Brien Highway | 2015.0 | 358-1 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 42.372868 | -71.079624 | POINT (-71.0796235216534 42.3728684150094) |
| 2 | B1465 | 2015 | 265B-34 | No | Non-Residential | 28054.0 | NaN | 205 Alewife Brook Parkway | 1975.0 | 76-21 | ... | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | 42.390311 | -71.143107 | POINT (-71.1431074295734 42.3903111902982) |
| 3 | B1550 | 2015 | 162-54 | No | Non-Residential | 36600.0 | 7.0 | 80-90 Winthrop st | 1975.0 | 396-10 | ... | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | 42.372122 | -71.120916 | POINT (-71.120916012646 42.3721215961329) |
| 4 | B1582 | 2015 | 14-47 | Yes | Non-Residential | 356882.0 | 1.0 | 500 Kendall Street | 2002.0 | 620-2 | ... | 58450497.8 | 170.0 | 167.3 | 2174.5 | 6.2 | 7472.3 | 21.39 | 42.363899 | -71.081408 | POINT (-71.0814075007659 42.3638994069543) |
5 rows × 46 columns
# Preview property database data
print('Preview property data')
property_data.head()
Preview property data
| PID | GISID | BldgNum | Address | Unit | Latitude | Longitude | StateClassCode | PropertyClass | Zoning | ... | Systems_Plumbing | Condition_YearBuilt | Condition_InteriorCondition | Condition_OverallCondition | Condition_OverallGrade | Parking_Open | Parking_Covered | Parking_Garage | UnfinishedBasementGross | FinishedBasementGross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7344 | 112-47 | 1 | 36 Highland Ave | 9 | 42.3727 | -71.1059 | 102 | CONDOMINIUM | NaN | ... | NaN | 1940 | NaN | Average | Average | 0.0 | 0.0 | 0.0 | 0 | 0.0 |
| 1 | 7345 | 112-47 | 1 | 36 Highland Ave | A | 42.3727 | -71.1059 | 102 | CONDOMINIUM | NaN | ... | NaN | 1940 | NaN | Average | Average | 0.0 | 0.0 | 0.0 | 0 | 0.0 |
| 2 | 7346 | 112-47 | 1 | 36 Highland Ave | B | 42.3727 | -71.1059 | 102 | CONDOMINIUM | NaN | ... | NaN | 1940 | NaN | Average | Average | 0.0 | 0.0 | 0.0 | 0 | 0.0 |
| 3 | 7347 | 112-47 | 1 | 36 Highland Ave | C | 42.3727 | -71.1059 | 102 | CONDOMINIUM | NaN | ... | NaN | 1940 | NaN | Average | Average | 0.0 | 0.0 | 0.0 | 0 | 0.0 |
| 4 | 7348 | 112-47 | 1 | 36 Highland Ave | D | 42.3727 | -71.1059 | 102 | CONDOMINIUM | NaN | ... | NaN | 1940 | NaN | Average | Average | 0.0 | 0.0 | 0.0 | 0 | 0.0 |
5 rows × 62 columns
A look at the first few rows of each dataset gives us some useful insights. There are some columns that will clearly be useless to our model, such as Address. Some columns are shared between the two datasets and could have potential while merging, like Latitude and Longitude. We can see a number of missing values in both datasets in just these first rows. The Reporting ID and PID of the two datests, their primary ID columns, don't seem to match up in either format or value, so we'll need to merge on something else. MapLot and GISID seem to have the same format, so they or similar columns may have promise.
4. Initial exploratory Data Analysis (EDA) and data cleaning¶
# As an example, inspect the rows for one particular address
energy_data[energy_data.Address == "130 Brookline Street"]
| Reporting ID | Data Year | MapLot | Annual Report Received | BEUDO Category | PD Parcel Living Area | PD Parcel Units | Address | Year Built | Buildings Included | ... | Weather Normalized Source Energy Use (kBtu) | Source EUI (kBtu/ft2) | Weather Normalized Source EUI (kBtu/ft2) | Total GHG Emissions (Metric Tons CO2e) | Total GHG Emissions Intensity (kgCO2e/ft2) | Water Use (All Water Sources) (kgal) | Water Intensity (All Water Sources) (gal/ft2) | Latitude | Longitude | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B1074 | 2015 | 95-73 | Yes | Non-Residential | 44720.0 | 1.0 | 130 Brookline Street | 2010.0 | 694-4 | ... | NaN | 801.1 | NaN | 1432.9 | 26.4 | 4754.0 | 87.64 | 42.360313 | -71.104509 | POINT (-71.1045085999411 42.3603129493224) |
| 1186 | B1074 | 2019 | 95-73 | Yes | Non-Residential | 44720.0 | 1.0 | 130 Brookline Street | 2010.0 | 694-4 | ... | 47614895.9 | 877.4 | 877.8 | 1408.5 | 26.0 | 2737.9 | 50.48 | 42.360313 | -71.104509 | POINT (-71.1045085999411 42.3603129493224) |
| 1887 | B1074 | 2018 | 95-73 | Yes | Non-Residential | 44720.0 | 1.0 | 130 Brookline Street | 2010.0 | 694-4 | ... | 44994851.0 | 831.1 | 829.5 | 1389.2 | 25.6 | 2429.9 | 44.80 | 42.360313 | -71.104509 | POINT (-71.1045085999411 42.3603129493224) |
| 3757 | B1074 | 2016 | 95-73 | Yes | Non-Residential | 44720.0 | 1.0 | 130 Brookline Street | 2010.0 | 694-4 | ... | NaN | 794.7 | NaN | 1388.9 | 25.6 | 2620.0 | 48.30 | 42.360313 | -71.104509 | POINT (-71.1045085999411 42.3603129493224) |
| 4223 | B1074 | 2021 | 95-73 | Yes | Non-Residential | 44720.0 | 1.0 | 130 Brookline Street | 2010.0 | 694-4 | ... | 46337321.4 | 859.4 | 854.3 | 1442.8 | 26.6 | 3054.4 | 56.31 | 42.360314 | -71.104513 | POINT (-71.1045128792698 42.3603142745612) |
| 4224 | B1074 | 2022 | 95-73 | Yes | Non-Residential | 44720.0 | 1.0 | 130 Brookline Street | 2010.0 | 694-4 | ... | 46557552.1 | 861.8 | 858.3 | 1440.1 | 26.5 | 3023.5 | 55.74 | 42.360314 | -71.104513 | POINT (-71.1045128792698 42.3603142745612) |
| 4752 | B1074 | 2017 | 95-73 | Yes | Non-Residential | 44720.0 | 1.0 | 130 Brookline Street | 2010.0 | 694-4 | ... | NaN | 767.2 | NaN | 1351.0 | 24.9 | 2284.8 | 42.12 | 42.360313 | -71.104509 | POINT (-71.1045085999411 42.3603129493224) |
| 5807 | B1074 | 2020 | 95-73 | Yes | Non-Residential | 44720.0 | 1.0 | 130 Brookline Street | 2010.0 | 694-4 | ... | 47797902.6 | 862.7 | 881.2 | 1432.8 | 26.4 | 2899.4 | 53.45 | 42.360314 | -71.104513 | POINT (-71.1045128792698 42.3603142745612) |
8 rows × 46 columns
Further column inspection reveals that we need to merge the two datasets on Data Year and MapLot, which equate to YearOfAssessment and GISID, respectively, in the property dataset. There will be multiple rows in the property dataset for every row in the energy use dataset (for those rows where matches can be found).
We plan to use weather-normalized site energy use intensity as our response variable, but since we may be able to impute that from Site EUI, we will drop rows where Site EUI is null rather than rows where weather-normalized site EUI is null. (Site EUI is non-null for all rows where Weather-normalized Site EUI is non-null.)
We also convert the year columns to datetime to ensure there are no formatting issues later.
# Remove rows with missing values in response variable
energy_data_clean = energy_data.dropna(subset=['Site EUI (kBtu/ft2)']) #dropna(subset=['Data Year', 'BEUDO Category', 'Source EUI (kBtu/ft2)', 'Total GHG Emissions (Metric Tons CO2e)'])
# Convert 'Data Year' to datetime
energy_data_clean = energy_data_clean.copy() # To handle an irrelevant warning
energy_data_clean['Data Year'] = pd.to_datetime(energy_data_clean['Data Year'], format='%Y')
energy_data_clean = energy_data_clean.copy() # To handle an irrelevant warning
energy_data_clean['Year Built'] = pd.to_datetime(energy_data_clean['Year Built'], format='%Y')
property_data['YearOfAssessment'] = pd.to_datetime(property_data['YearOfAssessment'], format='%Y')
# Merge datasets on Maplot and Year combination.
merged_data = pd.merge(energy_data_clean, property_data, left_on=['MapLot', 'Data Year'], right_on=['GISID', 'YearOfAssessment'], how='inner')
# Display basic information about the cleaned dataset
print(merged_data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 69059 entries, 0 to 69058 Columns: 108 entries, Reporting ID to FinishedBasementGross dtypes: bool(1), datetime64[ns](3), float64(52), int64(8), object(44) memory usage: 56.4+ MB None
The merged dataset contains 108 columns and 69,059 observations. The property use dataset must have many rows that don't match up with the properties included in the energy use dataset. This isn't overly surprising: The energy use dataset has selective rules for what types of properties are required to submit BEUDO reporting.
Now that we have our merged dataset, we can take a closer look at what information is available (and possibly duplicated) for each observation.
# For better ability to inspect columns, we remove the pandas column display maximum
pd.set_option('display.max_columns', None)
# Inspect the full set of columns for a subset of the dataset
merged_data.iloc[20:30,]
| Reporting ID | Data Year | MapLot | Annual Report Received | BEUDO Category | PD Parcel Living Area | PD Parcel Units | Address_x | Year Built | Buildings Included | Buildings Included Count | Primary Property Type - Self Selected | All Property Uses | Property GFA - Self Reported (ft2) | Reported Residential Units | Owner | Owner Line 2 | ENERGY STAR Score | Electricity Use - Grid Purchase (kWh) | Electricity Use - Grid Purchase (kBtu) | Natural Gas Use (therms) | Natural Gas Use (kBtu) | Fuel Oil #1 Use (kBtu) | Fuel Oil #2 Use (kBtu) | Fuel Oil #4 Use (kBtu) | Fuel Oil #5 & 6 Use (kBtu) | Diesel #2 Use (kBtu) | Kerosene Use (kBtu) | District Chilled Water Use (kBtu) | District Steam Use (kBtu) | Electricity Use - Generated From Onsite Renewable Systems (kWh) | Site Energy Use (kBtu) | Weather Normalized Site Energy Use (kBtu) | Site EUI (kBtu/ft2) | Weather Normalized Site EUI (kBtu/ft2) | Source Energy Use (kBtu) | Weather Normalized Source Energy Use (kBtu) | Source EUI (kBtu/ft2) | Weather Normalized Source EUI (kBtu/ft2) | Total GHG Emissions (Metric Tons CO2e) | Total GHG Emissions Intensity (kgCO2e/ft2) | Water Use (All Water Sources) (kgal) | Water Intensity (All Water Sources) (gal/ft2) | Latitude_x | Longitude_x | Location | PID | GISID | BldgNum | Address_y | Unit | Latitude_y | Longitude_y | StateClassCode | PropertyClass | Zoning | Map/Lot | LandArea | YearOfAssessment | TaxDistrict | ResidentialExemption | BuildingValue | LandValue | AssessedValue | SalePrice | Book/Page | SaleDate | PreviousAssessedValue | Owner_Name | Owner_CoOwnerName | Owner_Address | Owner_Address2 | Owner_City | Owner_State | Owner_Zip | Exterior_Style | Exterior_occupancy | Exterior_NumStories | Exterior_WallType | Exterior_WallHeight | Exterior_RoofType | Exterior_RoofMaterial | Exterior_FloorLocation | Exterior_View | Interior_LivingArea | Interior_NumUnits | Interior_TotalRooms | Interior_Bedrooms | Interior_Kitchens | Interior_FullBaths | Interior_HalfBaths | Interior_Fireplaces | Interior_Flooring | Interior_Layout | Interior_LaundryInUnit | Systems_HeatType | Systems_HeatFuel | Systems_CentralAir | Systems_Plumbing | Condition_YearBuilt | Condition_InteriorCondition | Condition_OverallCondition | Condition_OverallGrade | Parking_Open | Parking_Covered | Parking_Garage | UnfinishedBasementGross | FinishedBasementGross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20 | B1360 | 2017-01-01 | 157-69 | Yes | Non-Residential | 1462119.0 | 18.0 | 10 Everett Street | 2010-01-01 | 253-6 | 1 | Parking | Parking (223000.0) | 223000.0 | NaN | PRESIDENT & FELLOWS OF HARVARD COLLEGE | NaN | NaN | 549828.4 | 1876014.7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 42.379312 | -71.119269 | POINT (-71.1192692525073 42.3793118769153) | 12589 | 157-69 | 20 | 1525-1587 Massachusetts Ave | NaN | 42.3791 | -71.1197 | 942 | Private College, University | C-3 | 157-69 | 2175997 | 2017-01-01 | C7 | False | 611230200.0 | 212377400.0 | 8.236076e+08 | 0.0 | / | 01/01/1900 | 7.665125e+08 | PRESIDENT & FELLOWS OF HARVARD COLLEGE | C/O HARVARD REAL ESTATE, INC. | HOLYOKE CENTER,ROOM 1000 | 1350 MASSACHUSETTS AVE | CAMBRIDGE | MA | 02138-3895 | MULTIPURP COLL | Private College, University | 2.0 | BRICK | 12.0 | NaN | TAR-GRAVEL | NaN | NaN | 17872 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | HOT-WATER | Other | NaN | GOOD | 1965 | NaN | Very Good | EXCELLENT | NaN | NaN | NaN | 0 | NaN |
| 21 | B1360 | 2017-01-01 | 157-69 | Yes | Non-Residential | 1462119.0 | 18.0 | 10 Everett Street | 2010-01-01 | 253-6 | 1 | Parking | Parking (223000.0) | 223000.0 | NaN | PRESIDENT & FELLOWS OF HARVARD COLLEGE | NaN | NaN | 549828.4 | 1876014.7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 42.379312 | -71.119269 | POINT (-71.1192692525073 42.3793118769153) | 12589 | 157-69 | 21 | 1525-1587 Massachusetts Ave | NaN | 42.3791 | -71.1197 | 942 | Private College, University | C-3 | 157-69 | 2175997 | 2017-01-01 | C7 | False | 611230200.0 | 212377400.0 | 8.236076e+08 | 0.0 | / | 01/01/1900 | 7.665125e+08 | PRESIDENT & FELLOWS OF HARVARD COLLEGE | C/O HARVARD REAL ESTATE, INC. | HOLYOKE CENTER,ROOM 1000 | 1350 MASSACHUSETTS AVE | CAMBRIDGE | MA | 02138-3895 | MULTIPURP COLL | Private College, University | 2.0 | BRICK | 12.0 | NaN | TAR-GRAVEL | NaN | NaN | 22312 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | HOT-WATER | Other | NaN | GOOD | 1965 | NaN | Very Good | EXCELLENT | NaN | NaN | NaN | 5698 | NaN |
| 22 | B1360 | 2017-01-01 | 157-69 | Yes | Non-Residential | 1462119.0 | 18.0 | 10 Everett Street | 2010-01-01 | 253-6 | 1 | Parking | Parking (223000.0) | 223000.0 | NaN | PRESIDENT & FELLOWS OF HARVARD COLLEGE | NaN | NaN | 549828.4 | 1876014.7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 42.379312 | -71.119269 | POINT (-71.1192692525073 42.3793118769153) | 12589 | 157-69 | 22 | 1525-1587 Massachusetts Ave | NaN | 42.3791 | -71.1197 | 942 | Private College, University | C-3 | 157-69 | 2175997 | 2017-01-01 | C7 | False | 611230200.0 | 212377400.0 | 8.236076e+08 | 0.0 | / | 01/01/1900 | 7.665125e+08 | PRESIDENT & FELLOWS OF HARVARD COLLEGE | C/O HARVARD REAL ESTATE, INC. | HOLYOKE CENTER,ROOM 1000 | 1350 MASSACHUSETTS AVE | CAMBRIDGE | MA | 02138-3895 | COLL CLASSRM | Private College, University | 5.0 | STONE-VENEER | 14.0 | NaN | RUBBER-MEMBRAN | NaN | NaN | 164367 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | H.V.A.C. | Gas | NaN | EXCELLENT | 2011 | NaN | Excellent | EXCELLENT | NaN | NaN | NaN | 0 | NaN |
| 23 | B1360 | 2017-01-01 | 157-69 | Yes | Non-Residential | 1462119.0 | 18.0 | 10 Everett Street | 2010-01-01 | 253-6 | 1 | Parking | Parking (223000.0) | 223000.0 | NaN | PRESIDENT & FELLOWS OF HARVARD COLLEGE | NaN | NaN | 549828.4 | 1876014.7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 42.379312 | -71.119269 | POINT (-71.1192692525073 42.3793118769153) | 12589 | 157-69 | 23 | 1525-1587 Massachusetts Ave | NaN | 42.3791 | -71.1197 | 942 | Private College, University | C-3 | 157-69 | 2175997 | 2017-01-01 | C7 | False | 611230200.0 | 212377400.0 | 8.236076e+08 | 0.0 | / | 01/01/1900 | 7.665125e+08 | PRESIDENT & FELLOWS OF HARVARD COLLEGE | C/O HARVARD REAL ESTATE, INC. | HOLYOKE CENTER,ROOM 1000 | 1350 MASSACHUSETTS AVE | CAMBRIDGE | MA | 02138-3895 | MULTIPURP COLL | Private College, University | 4.0 | BRICK | 11.0 | NaN | ROLL-ROOFING | NaN | NaN | 80780 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | PACKAGE-A/C | NaN | NaN | GOOD | 1980 | NaN | Very Good | GOOD-V-GOOD | NaN | NaN | NaN | 0 | NaN |
| 24 | B1360 | 2017-01-01 | 157-69 | Yes | Non-Residential | 1462119.0 | 18.0 | 10 Everett Street | 2010-01-01 | 253-6 | 1 | Parking | Parking (223000.0) | 223000.0 | NaN | PRESIDENT & FELLOWS OF HARVARD COLLEGE | NaN | NaN | 549828.4 | 1876014.7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 42.379312 | -71.119269 | POINT (-71.1192692525073 42.3793118769153) | 12589 | 157-69 | 24 | 1525-1587 Massachusetts Ave | NaN | 42.3791 | -71.1197 | 942 | Private College, University | C-3 | 157-69 | 2175997 | 2017-01-01 | C7 | False | 611230200.0 | 212377400.0 | 8.236076e+08 | 0.0 | / | 01/01/1900 | 7.665125e+08 | PRESIDENT & FELLOWS OF HARVARD COLLEGE | C/O HARVARD REAL ESTATE, INC. | HOLYOKE CENTER,ROOM 1000 | 1350 MASSACHUSETTS AVE | CAMBRIDGE | MA | 02138-3895 | HISTORIC DORM | Private College, University | 4.0 | BRICK | 12.0 | NaN | SLATE-CLAY | NaN | NaN | 80792 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | HOT-WATER | NaN | NaN | AVERAGE | 1840 | NaN | Very Good | GOOD | NaN | NaN | NaN | 19598 | NaN |
| 25 | B1898 | 2017-01-01 | 18-75 | Yes | Municipal | 12571.0 | 0.0 | 59 & 69 First Street | 1993-01-01 | 505-2 | 1 | Restaurant | Restaurant (3215.0),Parking (697.0) | 3912.0 | NaN | CAMBRIDGE CITY OF | COMM. DEV. | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | 42.369012 | -71.078103 | POINT (-71.0781030672861 42.3690115477904) | 751 | 18-75 | 1 | 51-69 First St | NaN | 42.3687 | -71.0775 | 931 | Improved City | PUD-4B-BA | 18-75 | 75738 | 2017-01-01 | C1 | False | 45570200.0 | 13587600.0 | 5.915780e+07 | 0.0 | / | 01/01/1900 | 5.399520e+07 | CAMBRIDGE CITY OF | COMM. DEV. | 57 INMAN ST | NaN | CAMBRIDGE | MA | 02139 | PARKG-STRUCT | Improved City | 6.0 | BRICK | 10.0 | NaN | OTHER | NaN | NaN | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | OTHER | Other | NaN | AVERAGE | 1987 | NaN | Excellent | EXCELLENT | NaN | NaN | NaN | 0 | NaN |
| 26 | B1326 | 2019-01-01 | 52A-21 | Yes | Non-Residential | 3691096.0 | 0.0 | 125 MASSACHUSETTS AVE | 1928-01-01 | 668-33 | 1 | College/University | College/University (55046.0) | 55046.0 | NaN | MASSACHUSETTS INSTITUTE OF TECHNOLOGY | NaN | NaN | 559027.9 | 1907403.6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1154460.1 | 3708563.7 | NaN | 6770427.4 | 6713394.5 | 123.0 | 122.0 | 10852843.2 | 10840848.1 | 197.2 | 196.9 | 432.3 | 7.9 | 286.5 | 5.2 | 42.360179 | -71.093746 | POINT (-71.0937455339908 42.3601790096898) | 21897 | 52A-21 | 1 | 60 Vassar St | NaN | 42.3591 | -71.0933 | 942 | Private College, University | C-3 | 52A-21 | 1835526 | 2019-01-01 | C3 | False | 982713000.0 | 397575100.0 | 1.380288e+09 | 375000.0 | 906/ 350 | 09/04/1957 | 1.273895e+09 | MASSACHUSETTS INSTITUTE OF TECHNOLOGY | C/O MIT INVESTMENTS MANAGEMENT CO | ONE BROADWAY, SUITE 09-200 | NaN | CAMBRIDGE | MA | 02142 | MULTIPURP COLL | Private College, University | 4.0 | BRICK | 12.0 | NaN | METAL | NaN | NaN | 113760 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | PACKAGE-A/C | NaN | NaN | AVERAGE | 1913 | NaN | Average | EXCELLENT | NaN | NaN | NaN | 0 | NaN |
| 27 | B1326 | 2019-01-01 | 52A-21 | Yes | Non-Residential | 3691096.0 | 0.0 | 125 MASSACHUSETTS AVE | 1928-01-01 | 668-33 | 1 | College/University | College/University (55046.0) | 55046.0 | NaN | MASSACHUSETTS INSTITUTE OF TECHNOLOGY | NaN | NaN | 559027.9 | 1907403.6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1154460.1 | 3708563.7 | NaN | 6770427.4 | 6713394.5 | 123.0 | 122.0 | 10852843.2 | 10840848.1 | 197.2 | 196.9 | 432.3 | 7.9 | 286.5 | 5.2 | 42.360179 | -71.093746 | POINT (-71.0937455339908 42.3601790096898) | 21897 | 52A-21 | 2 | 60 Vassar St | NaN | 42.3591 | -71.0933 | 942 | Private College, University | C-3 | 52A-21 | 1835526 | 2019-01-01 | C3 | False | 982713000.0 | 397575100.0 | 1.380288e+09 | 375000.0 | 906/ 350 | 09/04/1957 | 1.273895e+09 | MASSACHUSETTS INSTITUTE OF TECHNOLOGY | C/O MIT INVESTMENTS MANAGEMENT CO | ONE BROADWAY, SUITE 09-200 | NaN | CAMBRIDGE | MA | 02142 | MULTIPURP COLL | Private College, University | 4.0 | STUCCO | 12.0 | NaN | METAL | NaN | NaN | 118836 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | PACKAGE-A/C | NaN | NaN | AVERAGE | 1913 | NaN | Excellent | EXCELLENT | NaN | NaN | NaN | 0 | NaN |
| 28 | B1326 | 2019-01-01 | 52A-21 | Yes | Non-Residential | 3691096.0 | 0.0 | 125 MASSACHUSETTS AVE | 1928-01-01 | 668-33 | 1 | College/University | College/University (55046.0) | 55046.0 | NaN | MASSACHUSETTS INSTITUTE OF TECHNOLOGY | NaN | NaN | 559027.9 | 1907403.6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1154460.1 | 3708563.7 | NaN | 6770427.4 | 6713394.5 | 123.0 | 122.0 | 10852843.2 | 10840848.1 | 197.2 | 196.9 | 432.3 | 7.9 | 286.5 | 5.2 | 42.360179 | -71.093746 | POINT (-71.0937455339908 42.3601790096898) | 21897 | 52A-21 | 3 | 60 Vassar St | NaN | 42.3591 | -71.0933 | 942 | Private College, University | C-3 | 52A-21 | 1835526 | 2019-01-01 | C3 | False | 982713000.0 | 397575100.0 | 1.380288e+09 | 375000.0 | 906/ 350 | 09/04/1957 | 1.273895e+09 | MASSACHUSETTS INSTITUTE OF TECHNOLOGY | C/O MIT INVESTMENTS MANAGEMENT CO | ONE BROADWAY, SUITE 09-200 | NaN | CAMBRIDGE | MA | 02142 | MULTIPURP COLL | Private College, University | 5.0 | STUCCO | 12.0 | NaN | METAL | NaN | NaN | 165310 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | PACKAGE-A/C | NaN | NaN | AVERAGE | 1913 | NaN | Good | EXCELLENT | NaN | NaN | NaN | 0 | NaN |
| 29 | B1326 | 2019-01-01 | 52A-21 | Yes | Non-Residential | 3691096.0 | 0.0 | 125 MASSACHUSETTS AVE | 1928-01-01 | 668-33 | 1 | College/University | College/University (55046.0) | 55046.0 | NaN | MASSACHUSETTS INSTITUTE OF TECHNOLOGY | NaN | NaN | 559027.9 | 1907403.6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1154460.1 | 3708563.7 | NaN | 6770427.4 | 6713394.5 | 123.0 | 122.0 | 10852843.2 | 10840848.1 | 197.2 | 196.9 | 432.3 | 7.9 | 286.5 | 5.2 | 42.360179 | -71.093746 | POINT (-71.0937455339908 42.3601790096898) | 21897 | 52A-21 | 4 | 60 Vassar St | NaN | 42.3591 | -71.0933 | 942 | Private College, University | C-3 | 52A-21 | 1835526 | 2019-01-01 | C3 | False | 982713000.0 | 397575100.0 | 1.380288e+09 | 375000.0 | 906/ 350 | 09/04/1957 | 1.273895e+09 | MASSACHUSETTS INSTITUTE OF TECHNOLOGY | C/O MIT INVESTMENTS MANAGEMENT CO | ONE BROADWAY, SUITE 09-200 | NaN | CAMBRIDGE | MA | 02142 | MULTIPURP COLL | Private College, University | 5.0 | STUCCO | 12.0 | NaN | METAL | NaN | NaN | 160355 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | PACKAGE-A/C | NaN | NaN | AVERAGE | 1913 | NaN | Good | EXCELLENT | NaN | NaN | NaN | 0 | NaN |
Looking at the first few rows, we see that several buildings are part of the same MapLot and reporting ID. The energy use dataset aggregates them, the property dataset does not. This means the energy use data is duplicated across these rows. The property dataset has useful columns such as exterior wall type, number of stories, etc., things that would be helpful in predicting and evaluating energy usage.
The question: How to relate the energy usage to these buildings? We can't treat each building as using the full amount of energy from the whole map lot. Can we aggregate somehow? Take the mode or mean of the descriptive columns (mode if categorical, mean if ordinal)? Or perhaps we divide energy use proportionally by Interior_LivingArea?
Looking more closely, the energy data in those first rows seems to only include two of the three buildings. Judging by the total living area vs the individual living areas of the buildings, it looks like the gym isn't included. Too small to be counted? Unknown.
Let's explore the data more closely. We'll start by taking a look at the distributions of some of our columns across the BEUDO categories.
# Distribution of buildings across BEUDO categories
plt.figure(figsize=(8, 4))
sns.countplot(data=merged_data, x='BEUDO Category')
plt.title('Distribution of Buildings Across BEUDO Categories')
plt.xticks(rotation=45)
plt.show()
# Average Source EUI by BEUDO Category
plt.figure(figsize=(8, 4))
sns.barplot(data=merged_data, x='BEUDO Category', y='Site EUI (kBtu/ft2)')
plt.title('Average Site EUI by BEUDO Category')
plt.xticks(rotation=45)
plt.show()
- Distribution of Buildings Across BEUDO Categories
- Observation: The vast majority of buildings are categorized as Residential and Non-Residential. The Municipal category has almost no buildings.
- Implication: Municipal buildings have very little data to back them. We can still try to model them, but it's a low priority. For better consistency in column availability, we can start by focusing on the Residential observations. Non-Residential observations are likely to use fewer of the predictive columns than Residential (they're less likely to have bedroom values, for instance), so getting good results is likely to be simultaneously more difficult and less complex.
- Average Site EUI (Energy Use Intensity) by BEUDO Category
- Observation: Non-Residential buildings have the highest average Site EUI, followed by Municipal buildings, with Residential buildings having the lowest.
- Implication: Non-Residential buildings consume more energy per square foot compared to other categories. This suggests that focusing on those buildings for energy efficiency improvements could yield significant reductions in overall energy consumption. Municipal buildings also present an opportunity for improvement.
# Weather-normalized energy use trend over time
plt.figure(figsize=(12, 6))
sns.lineplot(data=merged_data, x='Data Year', y='Weather Normalized Site EUI (kBtu/ft2)', hue='BEUDO Category')
plt.title(' Weather-Normalized Energy Use Intensity Trend Over Time by BEUDO Category')
plt.show()
- Weather-Normalized Energy Use Intensity Trend Over Time by BEUDO Category
As a continuation of the above, this is interesting. The Residential and Municipal trends are similar, with peaks in 2018 and 2021. The Non-Residential trend, meanwhile, changes dramatically. It now peaks in 2019 after a much lower preceding three years. It drops again in 2020, then starts trending upward.
The pattern from 2019 onward does stay mostly consistent between normalized and unnormalized EUI. Perhaps the drop in Non-Residential in 2020 is linked to the COVID pandemic, and as conditions improved business recovered?
One way or another, we can see that year will be a major predictor for Non-Residential data. Whether the trend assigned to it is real or not is unclear. The energy use certainly varies dramatically, but the buildings presumably aren't changing nearly so frequently.
# Inspect distribution of log of weather normalized site EUI
Log_EUI = np.log1p(merged_data['Weather Normalized Site EUI (kBtu/ft2)'])
sns.histplot(data=merged_data, x=Log_EUI, bins=30, kde=True, color='lightblue', alpha=0.7)
# Add labels and title
plt.title("Distribution of Log of Weather Normalized Site EUI")
plt.xlabel("ln(Weather Normalized Site EUI (kBtu/ft2))")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
- Log of Weather-Normalized Site EUI Distribution
- Observation: The log-transformed weather-normalized Site EUI closely follows a normal distribution. The majority of values cluster around 40,000, with a noticeable left skew at the lower end. The log-transformed weather-normalized site EUI should serve us well as a response variable no matter what model we use.
# Plot missing values
msno.matrix(merged_data, labels=True, label_rotation=90, fontsize=10)
plt.title('Missing values')
plt.show()
- Missing values
We can see some trends in missing values. The large white block toward the left is the individual fuel use columns, several of which have no entries at all. Those technically aren't missing, the null values just indicate that none of that fuel was used. The columns after that are the overall energy use columns, and we can see that rows that lack a value in one weather-normalized column appear to lack values in all weather-normalized columns.
There are columns in other areas that are concerningly sparse. Exterior_RoofType appears to be almost completely missing, as are Systems_CentralAir and Condition_Interior.
Comparing across the rows, it looks like rows with Unit values (presumably representing individual residential units) are the only ones with values for most of the Interior_ columns, while those without Unit values are the only ones with Exterior_ values. The Unit rows are also the ones with Parking_ and FinishedBasementGross values.
We will need to figure out how to handle these groupings separately.
# Plot distributions of all the variables
#fig, axes = plt.subplots(ncols=5, nrows=22, figsize=(30,110))
#axes = axes.flatten()
#for i,var in enumerate(merged_data.columns):
# if merged_data[var].dtype == 'float64':
# sns.histplot(merged_data, x=var, ax=axes[i])
# else:
# sns.countplot(merged_data, x=var, ax=axes[i])
a. Data Inspection and Cleaning¶
# Drop extraneous and irrelevant columns
bad_metadata = ['MapLot', 'Address_x', 'Address_y', 'Buildings Included', 'All Property Uses', 'Owner', 'Owner Line 2',
'Electricity Use - Grid Purchase (kWh)', 'Natural Gas Use (therms)', 'Latitude_x', 'Latitude_y', 'Longitude_x', 'Longitude_y',
'Location', 'PID', 'GISID', 'StateClassCode', 'PropertyClass', 'Zoning', 'Map/Lot', 'YearOfAssessment', 'TaxDistrict', 'ResidentialExemption',
'BuildingValue', 'LandValue', 'AssessedValue', 'SalePrice', 'Book/Page', 'SaleDate', 'PreviousAssessedValue', 'Owner_Name',
'Owner_CoOwnerName', 'Owner_Address', 'Owner_Address2', 'Owner_City', 'Owner_State', 'Owner_Zip']
merged_data = merged_data.drop(bad_metadata, axis=1)
We'll leave some questionable metadata columns in for now. Specifically, Annual Report Received.
Buildings Included is probably counterproductive for prediction, but it's useful for other comparisons. We'll drop it later.
All Property Uses could theoretically be turned into a mix of categorical values. Perhaps make a set of columns indicating amount of square feet dedicated to each of a number of common uses? We'll drop it for now, but we might be able to make use of it later.
It occurs to us that pretty much every energy use column in the energy dataset isn't usable for this. They just sum up to site energy use. They're the result, not the cause. Though we could potentially use them to create "uses X energy source" columns...
We'll set those ideas aside for the moment. For now, we have typos to fix and cleaning to do.
# Fix typos in categorical columns
merged_data.loc[merged_data.Systems_HeatType == 'Hot Water', 'Systems_HeatType'] = 'HOT-WATER'
merged_data.loc[merged_data.Systems_HeatType == 'Heat Pump', 'Systems_HeatType'] = 'HEAT-PUMP'
merged_data.loc[merged_data.Systems_HeatType == 'H.V.A.C', 'Systems_HeatType'] = 'H.V.A.C.'
merged_data.loc[merged_data.Systems_HeatType == 'Other', 'Systems_HeatType'] = 'OTHER'
merged_data.loc[merged_data.Systems_HeatType == 'Space Heat', 'Systems_HeatType'] = 'SPACE-HEAT'
merged_data.loc[merged_data.Systems_HeatType == 'Steam', 'Systems_HeatType'] = 'STEAM'
merged_data.loc[merged_data.Systems_HeatType == 'HW Radiator', 'Systems_HeatType'] = 'HOT-WATR-RAD'
merged_data.loc[merged_data.Systems_HeatType == 'Forced Air', 'Systems_HeatType'] = 'FORCED-AIR'
merged_data.loc[merged_data.Systems_HeatType == 'Elec Wall', 'Systems_HeatType'] = 'ELEC-WALL'
merged_data.loc[merged_data.Systems_HeatType == 'Elec-Radiant', 'Systems_HeatType'] = 'ELEC-RADIANT'
merged_data.loc[merged_data['Condition_OverallGrade'] == 'Good', 'Condition_OverallGrade'] = 'GOOD'
merged_data.loc[merged_data['Condition_OverallGrade'] == 'Good Very Good', 'Condition_OverallGrade'] = 'GOOD VERY GOOD'
merged_data.loc[merged_data['Condition_OverallGrade'] == 'Average', 'Condition_OverallGrade'] = 'AVERAGE'
merged_data.loc[merged_data['Condition_OverallGrade'] == 'Excellent', 'Condition_OverallGrade'] = 'EXCELLENT'
merged_data.loc[merged_data['Condition_OverallGrade'] == 'Fair', 'Condition_OverallGrade'] = 'FAIR'
merged_data.loc[merged_data['Condition_OverallGrade'] == 'GOOD-V-GOOD', 'Condition_OverallGrade'] = 'GOOD VERY GOOD'
merged_data.loc[merged_data['Condition_OverallGrade'] == 'VERY-GOOD', 'Condition_OverallGrade'] = 'VERY GOOD'
merged_data.loc[merged_data['Condition_OverallGrade'] == 'VERY-GOOD-EX', 'Condition_OverallGrade'] = 'VERY GOOD EXCELLENT'
merged_data.loc[merged_data['Condition_OverallGrade'] == 'VERY-GOOD', 'Condition_OverallGrade'] = 'VERY GOOD'
merged_data.loc[merged_data['Condition_OverallGrade'] == 'Good Very Good', 'Condition_OverallGrade'] = 'GOOD VERY GOOD'
merged_data.loc[merged_data['Condition_OverallGrade'] == 'Very Good', 'Condition_OverallGrade'] = 'VERY GOOD'
merged_data.loc[merged_data['Condition_OverallGrade'] == 'Poor', 'Condition_OverallGrade'] = 'POOR'
# Check the options for interior kitchens
merged_data.Interior_Kitchens.value_counts()
Interior_Kitchens Pull Alcove 15795 Galley 8171 Full Eat In 5158 2 94 6 84 0 23 1 2 3 1 Name: count, dtype: int64
We can't weight the kitchens by interior living space. One kitchen is still one kitchen, no matter how much space the living area takes up. With the variety of options available, we also can't make use of the three descriptive responses. We doubt the type is important to energy consumption anyway. Instead, we'll convert to counts. Pull Alcove, Galley, and Full Eat In will all be translated to 1 and then the column will be treated as numeric. We'll sum it when aggregating to property.
# Check the options for interior flooring and layout
print(merged_data.Interior_Flooring.value_counts(dropna=False))
merged_data.Interior_Layout.value_counts(dropna=False)
Interior_Flooring NaN 39981 Hardwood 25345 Carpet 3299 Vinyl/Linoleum 240 Softwood 123 Cermc/Qry Tile 57 Marble 14 Name: count, dtype: int64
Interior_Layout NaN 38187 No Impact 19543 Corner Unit 7419 Thru Unit 3572 Poor 305 Superior 33 Name: count, dtype: int64
These descriptive columns are good, at least. There are plenty of missing values, but the categories themselves are coherent. Not ordered, though; we'll need to one-hot encode them, and in this case we'll need to keep all of the categories. The default will be "None of the above", i.e. either N/A or missing. (We'll want to do something about ones that are missing that shouldn't be.)
# Check whether there are any rows without annual report received
merged_data['Annual Report Received'].value_counts(dropna=False)
Annual Report Received Yes 69056 No 3 Name: count, dtype: int64
# Look at the few rows where annual report isn't received
merged_data[merged_data['Annual Report Received'] == 'No']
| Reporting ID | Data Year | Annual Report Received | BEUDO Category | PD Parcel Living Area | PD Parcel Units | Year Built | Buildings Included Count | Primary Property Type - Self Selected | Property GFA - Self Reported (ft2) | Reported Residential Units | ENERGY STAR Score | Electricity Use - Grid Purchase (kBtu) | Natural Gas Use (kBtu) | Fuel Oil #1 Use (kBtu) | Fuel Oil #2 Use (kBtu) | Fuel Oil #4 Use (kBtu) | Fuel Oil #5 & 6 Use (kBtu) | Diesel #2 Use (kBtu) | Kerosene Use (kBtu) | District Chilled Water Use (kBtu) | District Steam Use (kBtu) | Electricity Use - Generated From Onsite Renewable Systems (kWh) | Site Energy Use (kBtu) | Weather Normalized Site Energy Use (kBtu) | Site EUI (kBtu/ft2) | Weather Normalized Site EUI (kBtu/ft2) | Source Energy Use (kBtu) | Weather Normalized Source Energy Use (kBtu) | Source EUI (kBtu/ft2) | Weather Normalized Source EUI (kBtu/ft2) | Total GHG Emissions (Metric Tons CO2e) | Total GHG Emissions Intensity (kgCO2e/ft2) | Water Use (All Water Sources) (kgal) | Water Intensity (All Water Sources) (gal/ft2) | BldgNum | Unit | LandArea | Exterior_Style | Exterior_occupancy | Exterior_NumStories | Exterior_WallType | Exterior_WallHeight | Exterior_RoofType | Exterior_RoofMaterial | Exterior_FloorLocation | Exterior_View | Interior_LivingArea | Interior_NumUnits | Interior_TotalRooms | Interior_Bedrooms | Interior_Kitchens | Interior_FullBaths | Interior_HalfBaths | Interior_Fireplaces | Interior_Flooring | Interior_Layout | Interior_LaundryInUnit | Systems_HeatType | Systems_HeatFuel | Systems_CentralAir | Systems_Plumbing | Condition_YearBuilt | Condition_InteriorCondition | Condition_OverallCondition | Condition_OverallGrade | Parking_Open | Parking_Covered | Parking_Garage | UnfinishedBasementGross | FinishedBasementGross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 47499 | B1139 | 2018-01-01 | No | Non-Residential | 182389.0 | NaN | 2017-01-01 | 1 | College/University | 102899.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 24916.0 | 210575.5 | NaN | 2.0 | NaN | 436587.3 | NaN | 4.2 | NaN | 14.8 | 0.1 | NaN | NaN | 1 | NaN | 126654 | MULTIPURP COLL | Private College, University | 4.0 | BRICK | 10.0 | NaN | RUBBER-MEMBRAN | NaN | NaN | 37060 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | H.V.A.C. | Electric | NaN | AVERAGE | 2016 | NaN | Excellent | GOOD VERY GOOD | NaN | NaN | NaN | 9265 | NaN |
| 52417 | B1241 | 2021-01-01 | No | Residential | 62875.0 | NaN | 1893-01-01 | 1 | Residence Hall/Dormitory | 58818.0 | NaN | 94.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2397530.1 | NaN | 3265869.8 | NaN | 55.5 | NaN | 3674586.4 | NaN | 62.5 | NaN | 205.0 | 3.5 | 90.9 | 1.55 | 1 | NaN | 14347 | HISTORIC DORM | Private College, University | 5.0 | BRICK | 11.0 | NaN | ASPHALT-SHNG | NaN | NaN | 62875 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | HOT-WATER | NaN | NaN | AVERAGE | 1890 | NaN | Good | GOOD | NaN | NaN | NaN | 9942 | NaN |
| 52530 | B1241 | 2020-01-01 | No | Residential | 62875.0 | NaN | 1893-01-01 | 1 | Residence Hall/Dormitory | 58818.0 | NaN | 95.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2958509.9 | NaN | 2958509.9 | 3249638.6 | 50.3 | 55.2 | 3559209.0 | 3909448.8 | 60.5 | 66.5 | 196.4 | 3.3 | 0.0 | 0.00 | 1 | NaN | 14347 | HISTORIC DORM | Private College, University | 5.0 | BRICK | 11.0 | NaN | ASPHALT-SHNG | NaN | NaN | 62875 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | HOT-WATER | NaN | NaN | AVERAGE | 1890 | NaN | Good | GOOD | NaN | NaN | NaN | 9942 | NaN |
"
Describes whether the City received an energy use report for the building. "Yes" values indicate reports that include values for grid-purchased electricity or natural gas use. "No" values indicate that no report was received, or the report did not include electricity or natural gas use.
"
It's best if we drop these. There are only three, and we don't know how site energy use and the like were arrived at if these details were left out.
# Drop rows where no annual report was received
merged_data = merged_data[merged_data['Annual Report Received'] == 'Yes']
# Drop the Annual Report Received column
merged_data = merged_data.drop(['Annual Report Received'], axis=1)
We'll check roof types and materials for category issues next.
# Check roof types and roof materials
print(merged_data.Exterior_RoofType.value_counts(dropna=False))
merged_data.Exterior_RoofMaterial.value_counts(dropna=False)
Exterior_RoofType NaN 68866 Gable 95 Flat 86 Mansard 9 Name: count, dtype: int64
Exterior_RoofMaterial NaN 32200 SLATE-CLAY 11284 TAR-GRAVEL 10550 RUBBER-MEMBRAN 3785 METAL 3635 ROLL-ROOFING 3047 ASPHALT-SHNG 2623 OTHER 1173 WOOD-SHINGLE 218 TILE 215 COMPOSITION 136 Aspahlt Shingl 103 Rubber Membrai 87 Name: count, dtype: int64
There are a couple of clear typos in the latter category set. We'll fix those and unify them with the main categories of the same name.
# Fix the roof materials with typos
merged_data.loc[merged_data['Exterior_RoofMaterial'] == 'Aspahlt Shingl', 'Exterior_RoofMaterial'] = 'ASPHALT-SHNG'
merged_data.loc[merged_data['Exterior_RoofMaterial'] == 'Rubber Membrai', 'Exterior_RoofMaterial'] = 'RUBBER-MEMBRAN'
merged_data.Exterior_RoofMaterial.value_counts(dropna=False)
Exterior_RoofMaterial NaN 32200 SLATE-CLAY 11284 TAR-GRAVEL 10550 RUBBER-MEMBRAN 3872 METAL 3635 ROLL-ROOFING 3047 ASPHALT-SHNG 2726 OTHER 1173 WOOD-SHINGLE 218 TILE 215 COMPOSITION 136 Name: count, dtype: int64
It looks like roof type isn't going to be useful to us. Not nearly enough observations report values for it. Roof material is more promising, though that's still a lot of rows with no response.
Next, exterior wall type.
# Check exterior wall type for missing and typos
merged_data.Exterior_WallType.value_counts(dropna=False)
Exterior_WallType NaN 32113 BRICK 21054 OTHER 5115 WOOD-SHN-SHK 2757 STUCCO 2568 FRAME-CLAPBD 1498 CONCR-BLOCK 1004 CONCR PANEL 770 STONE-VENEER 748 BRICK-VENEER 721 METAL-GLASS 388 Frame-Clapbrd 188 METAL 113 TILT PANEL 12 ASPHALT-SHNG 3 ASBSTOS-SHNG 2 Brick 2 Name: count, dtype: int64
# Fix wall type typos
merged_data.loc[merged_data['Exterior_WallType'] == 'Frame-Clapbrd', 'Exterior_WallType'] = 'FRAME-CLAPBD'
merged_data.loc[merged_data['Exterior_WallType'] == 'Brick', 'Exterior_WallType'] = 'BRICK'
merged_data.Exterior_WallType.value_counts(dropna=False)
Exterior_WallType NaN 32113 BRICK 21056 OTHER 5115 WOOD-SHN-SHK 2757 STUCCO 2568 FRAME-CLAPBD 1686 CONCR-BLOCK 1004 CONCR PANEL 770 STONE-VENEER 748 BRICK-VENEER 721 METAL-GLASS 388 METAL 113 TILT PANEL 12 ASPHALT-SHNG 3 ASBSTOS-SHNG 2 Name: count, dtype: int64
# Check whether rows with RoofMaterial also have WallType
merged_data.loc[merged_data.Exterior_RoofMaterial.notna(), 'Exterior_WallType'].value_counts(dropna=False)
Exterior_WallType BRICK 21029 OTHER 5115 WOOD-SHN-SHK 2757 STUCCO 2568 FRAME-CLAPBD 1650 CONCR-BLOCK 1002 CONCR PANEL 770 STONE-VENEER 738 BRICK-VENEER 721 METAL-GLASS 378 METAL 113 TILT PANEL 12 ASBSTOS-SHNG 2 NaN 1 Name: count, dtype: int64
So. If we keep only the rows where we have roof material data, we also drop all but one of the rows with no reported exterior wall type. This tracks with what we saw with the earlier missing data chart. For the moment, restricting ourselves to only the rows where we have usable data for such things may be the way to go. We can try to figure out what to do with the rest of the data later.
# Check distribution of exterior style
pd.set_option("display.max_rows", None)
merged_data.Exterior_Style.value_counts(dropna=False)
Exterior_Style NaN 17064 APARTMENT 13154 MULTIPURP COLL 7463 HISTORIC INSTITUTNL 5162 HISTORIC DORM 3864 LAB CLASSRM 3853 DORMITORY 2837 COLL CLASSRM 1827 APARTMENTS 1751 ELEM SCHOOL 1725 LIBRARY 1691 LIFE SCI HI TECH 1451 OFFICE 1013 STORAGE-WHSE 886 GYMNASIUM 517 TWO-FLOORS 392 TOWNHOUSE 381 PARKG-STRUCT 371 GEN-OFFICE 343 FLAT 328 HISTORIC CHURCH 248 COLL CAFETERIA 233 OTHER 232 CHURCH 219 APTS-LUXURY 163 MULT-RESDNCE 157 INDUST-MFCTR 146 R & D 145 Commercial Condo 109 RETAIL-STORE 94 DUPLEX 94 TOWNHSE-END 93 BASEMENT 92 HOTEL 92 MUN. GOVT BLDG 76 GYM 73 REC POOL BLDG 70 TOWNHSE-TYPE 66 COM-GRGE-SVC 62 MUSEUM 62 MIX NBHD CTR W/COM 46 FIREHOUSE 35 HOSPITAL 34 GRG-UNDERGND 31 CLEAN MFG 27 AUDITORIUM 26 LIFE SCIENCE LAB 25 INDUSTRL CONDO 22 WATER PLANT 21 RESTAURANT 18 ASTD LIVING 14 CLUB-HOUSE 14 HIGH SCHOOL 14 CINEMA 14 LIVE THEATRE 12 COM SHOP CTR 11 PARKING 11 RSRCH IND CND 10 MEDICAL-OFC 8 BANK 7 DAY-CARE-CTR 7 MARKET 7 AUTO-SHWROOM 7 DEPT-STORE 7 MIX NBHD CTR W/RES 6 BOATHOUSE 6 DISTRIB-WHSE 6 RETIRE-HOME 5 CONVENTIONAL 3 FREE-STANDING 2 POST-OFFICE 1 Name: count, dtype: int64
That's 71 different exterior styles, all of which are less "style" and more purpose. Theoretically interesting but very hard to work with. We could take the first few as categories and then call the rest as "other", but that's a lot of other, and there's no reason to believe the "other" group is even remotely unified. With work and time, we could do manual consolidation in sensible groups, but we don't have that time. Perhaps keep the groups with more than 500 observations and set all the others to "Other"? Something to consider later. For now, we'll drop this column as unusable.
# Drop Exterior_Style
merged_data = merged_data.drop('Exterior_Style', axis = 1)
# Check Exterior_occupancy for the same reasons
merged_data.Exterior_occupancy.value_counts(dropna=False)
Exterior_occupancy NaN 31957 Private College, University 30495 Private Elementary Education 1658 >8-UNIT-APT 1166 GEN-OFFICE 437 Housing Authority 436 HIGH-TECH 433 INV-OFFICE 348 Improved City 234 RES-&-DEV-FC 135 Improved Local Edu 130 Church 124 RETAIL-STORE 115 BOARDING-HSE 112 RETAIL-OFFIC 111 Private Secondary Education 107 HOTEL 103 MULT-RES-2FAM 94 SNGL-FAM-RES 93 WAREHOUSE 81 PARKING-GAR 66 Improved Public Safety 63 CLEAN-MANUF 56 MXD >8-UNIT-APT 50 MANUFACTURNG 47 AFFORDABLE APT 45 Charitable Svc 41 4-8-UNIT-APT 39 MULTIUSE-RES 33 Utility Authority 28 SH-CNTR/MALL 27 MULT-RES-4-8-APT 24 ASSISTED-LIV 21 EATING-ESTBL 18 Other- Scientific 14 Hospitals 14 TENNIS-CLUB 14 US Government 8 AUTO-REPAIR 7 DCR- State Parks and Rec 7 TELE-EXCH-STA 7 AUTO-SALES 7 MULTIUSE-RES-COM 7 MEDICAL-OFFC 7 121 Corporation 6 MXD RETAIL-STORE 5 MXD RETAIL-OFFIC 5 INN-RESORT 4 MXD SH-CNTR/MALL 4 NURSING-HOME 2 MXD GEN-OFFICE 2 MULT-RES-1FAM 2 MULT-RES->8 APT 2 TWO-FAM-RES 2 BANK 1 MXD HIGH-TECH 1 THREE-FM-RES 1 Name: count, dtype: int64
# Check variety of exterior occupancy for just residential
merged_data.loc[merged_data['BEUDO Category'] == 'Residential'].Exterior_occupancy.value_counts(dropna=False)
Exterior_occupancy NaN 27797 >8-UNIT-APT 1154 Private College, University 517 Housing Authority 289 BOARDING-HSE 112 MULT-RES-2FAM 94 MXD >8-UNIT-APT 50 AFFORDABLE APT 45 4-8-UNIT-APT 39 MULTIUSE-RES 33 MULT-RES-4-8-APT 24 ASSISTED-LIV 21 Hospitals 14 SNGL-FAM-RES 9 RETAIL-STORE 9 MULTIUSE-RES-COM 7 121 Corporation 5 PARKING-GAR 2 GEN-OFFICE 2 MULT-RES-1FAM 2 MULT-RES->8 APT 2 TWO-FAM-RES 2 THREE-FM-RES 1 Name: count, dtype: int64
This column is even more sparse and less useful than Exterior_Style. We'll drop it for the same reasons.
# Drop exterior occupancy
merged_data = merged_data.drop('Exterior_occupancy', axis = 1)
That done, we'll check how many NAs remain in the columns.
# Count rows that include NA
merged_data.isna().sum(axis=0)
Reporting ID 0 Data Year 0 BEUDO Category 0 PD Parcel Living Area 0 PD Parcel Units 2715 Year Built 0 Buildings Included Count 0 Primary Property Type - Self Selected 0 Property GFA - Self Reported (ft2) 58 Reported Residential Units 35918 ENERGY STAR Score 26809 Electricity Use - Grid Purchase (kBtu) 1248 Natural Gas Use (kBtu) 14451 Fuel Oil #1 Use (kBtu) 69056 Fuel Oil #2 Use (kBtu) 68863 Fuel Oil #4 Use (kBtu) 69056 Fuel Oil #5 & 6 Use (kBtu) 69056 Diesel #2 Use (kBtu) 67161 Kerosene Use (kBtu) 69049 District Chilled Water Use (kBtu) 48395 District Steam Use (kBtu) 41108 Electricity Use - Generated From Onsite Renewable Systems (kWh) 68417 Site Energy Use (kBtu) 0 Weather Normalized Site Energy Use (kBtu) 6696 Site EUI (kBtu/ft2) 0 Weather Normalized Site EUI (kBtu/ft2) 6393 Source Energy Use (kBtu) 0 Weather Normalized Source Energy Use (kBtu) 6696 Source EUI (kBtu/ft2) 0 Weather Normalized Source EUI (kBtu/ft2) 6393 Total GHG Emissions (Metric Tons CO2e) 0 Total GHG Emissions Intensity (kgCO2e/ft2) 0 Water Use (All Water Sources) (kgal) 6700 Water Intensity (All Water Sources) (gal/ft2) 8375 BldgNum 0 Unit 37114 LandArea 0 Exterior_NumStories 0 Exterior_WallType 32113 Exterior_WallHeight 32146 Exterior_RoofType 68866 Exterior_RoofMaterial 32200 Exterior_FloorLocation 37958 Exterior_View 37114 Interior_LivingArea 0 Interior_NumUnits 31942 Interior_TotalRooms 36910 Interior_Bedrooms 36910 Interior_Kitchens 39728 Interior_FullBaths 36910 Interior_HalfBaths 36910 Interior_Fireplaces 36910 Interior_Flooring 39978 Interior_Layout 38184 Interior_LaundryInUnit 37114 Systems_HeatType 1716 Systems_HeatFuel 52203 Systems_CentralAir 68852 Systems_Plumbing 32199 Condition_YearBuilt 0 Condition_InteriorCondition 68866 Condition_OverallCondition 141 Condition_OverallGrade 17 Parking_Open 36910 Parking_Covered 36910 Parking_Garage 37114 UnfinishedBasementGross 0 FinishedBasementGross 36910 dtype: int64
As previously noted in the missingness chart, there are several columns with no values at all. We'll remove those. There are other columns where missing values clearly mean 0, so we'll fill those in.
# Drop energy use columns that are empty for all rows
merged_data = merged_data.drop(['Fuel Oil #1 Use (kBtu)', 'Fuel Oil #4 Use (kBtu)', 'Fuel Oil #5 & 6 Use (kBtu)'], axis = 1)
# Fill NAs where NA means 0
# Assume non-residential rows without reported residential units have 0 reported residential units
merged_data.loc[(merged_data['Reported Residential Units'].isna()) & (merged_data['BEUDO Category'] != 'Residential'), 'Reported Residential Units'] = 0
# Assume all energy use columns are 0 if NA
energy_use_cols = ['Natural Gas Use (kBtu)', 'Fuel Oil #2 Use (kBtu)', 'Diesel #2 Use (kBtu)', 'Kerosene Use (kBtu)', 'District Chilled Water Use (kBtu)', 'District Steam Use (kBtu)',
'Electricity Use - Generated From Onsite Renewable Systems (kWh)']
merged_data[energy_use_cols] = merged_data[energy_use_cols].fillna(0)
As a point of curiosity, let's check whether there are residential units that are recorded as having basement area larger than 0.
# Check rows with a Unit value and an unfinished basement
merged_data[merged_data.Unit.notna() & (merged_data.UnfinishedBasementGross > 0)].sort_values('Reporting ID').head(10)
| Reporting ID | Data Year | BEUDO Category | PD Parcel Living Area | PD Parcel Units | Year Built | Buildings Included Count | Primary Property Type - Self Selected | Property GFA - Self Reported (ft2) | Reported Residential Units | ENERGY STAR Score | Electricity Use - Grid Purchase (kBtu) | Natural Gas Use (kBtu) | Fuel Oil #2 Use (kBtu) | Diesel #2 Use (kBtu) | Kerosene Use (kBtu) | District Chilled Water Use (kBtu) | District Steam Use (kBtu) | Electricity Use - Generated From Onsite Renewable Systems (kWh) | Site Energy Use (kBtu) | Weather Normalized Site Energy Use (kBtu) | Site EUI (kBtu/ft2) | Weather Normalized Site EUI (kBtu/ft2) | Source Energy Use (kBtu) | Weather Normalized Source Energy Use (kBtu) | Source EUI (kBtu/ft2) | Weather Normalized Source EUI (kBtu/ft2) | Total GHG Emissions (Metric Tons CO2e) | Total GHG Emissions Intensity (kgCO2e/ft2) | Water Use (All Water Sources) (kgal) | Water Intensity (All Water Sources) (gal/ft2) | BldgNum | Unit | LandArea | Exterior_NumStories | Exterior_WallType | Exterior_WallHeight | Exterior_RoofType | Exterior_RoofMaterial | Exterior_FloorLocation | Exterior_View | Interior_LivingArea | Interior_NumUnits | Interior_TotalRooms | Interior_Bedrooms | Interior_Kitchens | Interior_FullBaths | Interior_HalfBaths | Interior_Fireplaces | Interior_Flooring | Interior_Layout | Interior_LaundryInUnit | Systems_HeatType | Systems_HeatFuel | Systems_CentralAir | Systems_Plumbing | Condition_YearBuilt | Condition_InteriorCondition | Condition_OverallCondition | Condition_OverallGrade | Parking_Open | Parking_Covered | Parking_Garage | UnfinishedBasementGross | FinishedBasementGross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38963 | B1156 | 2022-01-01 | Non-Residential | 40876.0 | 0.0 | 2004-01-01 | 2 | College/University | 50347.0 | 0.0 | NaN | 636223.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 636223.7 | 636223.7 | 12.6 | 12.6 | 1781426.3 | 1781426.3 | 35.4 | 35.4 | 45.1 | 0.9 | NaN | NaN | 1 | 3 | 0 | 2.0 | NaN | NaN | NaN | NaN | NaN | Average | 4096 | NaN | 6.0 | 0.0 | NaN | 2.0 | 0.0 | 0.0 | NaN | No Impact | False | HEAT-PUMP | NaN | NaN | NaN | 1873 | NaN | Very Good | EXCELLENT | 0.0 | 0.0 | 0.0 | 1081 | 0.0 |
| 19814 | B1156 | 2019-01-01 | Non-Residential | 40876.0 | 0.0 | 2004-01-01 | 2 | College/University | 50347.0 | 0.0 | NaN | 1091705.9 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1091705.9 | 1091705.9 | 21.7 | 21.7 | 3056776.6 | 3056776.6 | 60.7 | 60.7 | 71.7 | 1.4 | NaN | NaN | 1 | 3 | 0 | 2.0 | NaN | NaN | NaN | NaN | NaN | Average | 4096 | NaN | 6.0 | 0.0 | NaN | 2.0 | 0.0 | 0.0 | NaN | No Impact | False | HEAT-PUMP | NaN | NaN | NaN | 1873 | NaN | Very Good | EXCELLENT | 0.0 | 0.0 | 0.0 | 1081 | 0.0 |
| 30351 | B1156 | 2016-01-01 | Non-Residential | 40876.0 | 0.0 | 2004-01-01 | 2 | College/University | 50347.0 | 0.0 | NaN | 1240515.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1240515.0 | 1240515.0 | 24.6 | 24.6 | 3473441.9 | 3473441.9 | 69.0 | 69.0 | 93.0 | 1.8 | NaN | NaN | 1 | 3 | 0 | 2.0 | NaN | NaN | NaN | NaN | NaN | Average | 4096 | NaN | 6.0 | 0.0 | NaN | 2.0 | 0.0 | 0.0 | NaN | No Impact | False | FORCED-AIR | NaN | NaN | NaN | 1873 | NaN | Average | AVERAGE | 0.0 | 0.0 | 0.0 | 1081 | 0.0 |
| 49981 | B1156 | 2021-01-01 | Non-Residential | 40876.0 | 0.0 | 2004-01-01 | 2 | College/University | 50347.0 | 0.0 | NaN | 769380.2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 769380.2 | 769380.2 | 15.3 | 15.3 | 2154264.5 | 2154264.5 | 42.8 | 42.8 | 54.5 | 1.1 | NaN | NaN | 1 | 3 | 0 | 2.0 | NaN | NaN | NaN | NaN | NaN | Average | 4096 | NaN | 6.0 | 0.0 | NaN | 2.0 | 0.0 | 0.0 | NaN | No Impact | False | HEAT-PUMP | NaN | NaN | NaN | 1873 | NaN | Very Good | EXCELLENT | 0.0 | 0.0 | 0.0 | 1081 | 0.0 |
| 4126 | B1156 | 2020-01-01 | Non-Residential | 40876.0 | 0.0 | 2004-01-01 | 2 | College/University | 50347.0 | 0.0 | NaN | 595793.8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 595793.8 | 595793.8 | 11.8 | 11.8 | 1668222.7 | 1668222.7 | 33.1 | 33.1 | 42.2 | 0.8 | NaN | NaN | 1 | 3 | 0 | 2.0 | NaN | NaN | NaN | NaN | NaN | Average | 4096 | NaN | 6.0 | 0.0 | NaN | 2.0 | 0.0 | 0.0 | NaN | No Impact | False | HEAT-PUMP | NaN | NaN | NaN | 1873 | NaN | Very Good | EXCELLENT | 0.0 | 0.0 | 0.0 | 1081 | 0.0 |
| 53717 | B1554 | 2022-01-01 | Residential | 45973.0 | 63.0 | 1950-01-01 | 1 | Multifamily Housing | 48704.0 | 63.0 | 65.0 | 606950.4 | 3724800.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4331750.4 | 4623507.2 | 88.9 | 94.9 | 5610501.2 | 5904335.5 | 115.2 | 121.2 | 240.8 | 4.9 | 1843.3 | 37.85 | 1 | 842 | 0 | 1.0 | NaN | NaN | NaN | NaN | 1.0 | Average | 925 | NaN | 0.0 | 0.0 | NaN | 1.0 | 0.0 | 0.0 | Hardwood | Thru Unit | False | HOT-WATR-RAD | NaN | NaN | NaN | 1920 | NaN | Average | AVERAGE | 0.0 | 0.0 | 0.0 | 929 | 0.0 |
| 47679 | B1554 | 2021-01-01 | Residential | 45973.0 | 63.0 | 1950-01-01 | 1 | Multifamily Housing | 48704.0 | 63.0 | 61.0 | 597219.4 | 3933500.3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4530719.7 | 4925774.7 | 93.0 | 101.1 | 5802389.6 | 6190949.3 | 119.1 | 127.1 | 251.2 | 5.2 | 1639.9 | 33.67 | 1 | 842 | 0 | 1.0 | NaN | NaN | NaN | NaN | 1.0 | Average | 925 | NaN | 0.0 | 0.0 | NaN | 1.0 | 0.0 | 0.0 | Hardwood | Thru Unit | False | HOT-WATR-RAD | NaN | NaN | NaN | 1920 | NaN | Average | AVERAGE | 0.0 | 0.0 | 0.0 | 929 | 0.0 |
| 57617 | B1554 | 2020-01-01 | Residential | 45973.0 | 63.0 | 1950-01-01 | 1 | Multifamily Housing | 48704.0 | 63.0 | 49.0 | 616033.2 | 4289000.1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4905033.3 | 5337169.3 | 100.7 | 109.6 | 6228343.2 | 6706551.9 | 127.9 | 137.7 | 271.5 | 5.6 | 1968.1 | 40.41 | 1 | 842 | 0 | 1.0 | NaN | NaN | NaN | NaN | 1.0 | Average | 925 | NaN | 0.0 | 0.0 | NaN | 1.0 | 0.0 | 0.0 | Hardwood | Thru Unit | False | HOT-WATR-RAD | NaN | NaN | NaN | 1920 | NaN | Average | AVERAGE | 0.0 | 0.0 | 0.0 | 929 | 0.0 |
| 47101 | B1632 | 2020-01-01 | Residential | 85123.0 | 57.0 | 1979-01-01 | 8 | Multifamily Housing | 57875.0 | 57.0 | 69.0 | 1270905.2 | 3261399.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4532304.9 | 4953871.5 | 78.3 | 85.6 | 6983004.2 | 7465126.8 | 120.7 | 129.0 | 263.3 | 4.5 | 184.0 | 3.18 | 1 | 36 | 0 | 3.0 | NaN | NaN | NaN | NaN | 1.0 | Average | 1715 | NaN | 7.0 | 3.0 | Galley | 2.0 | 1.0 | 0.0 | Carpet | Corner Unit | True | FORCED-AIR | NaN | NaN | NaN | 1981 | NaN | Good | AVERAGE | 0.0 | 0.0 | 2.0 | 646 | 0.0 |
| 47100 | B1632 | 2020-01-01 | Residential | 85123.0 | 57.0 | 1979-01-01 | 8 | Multifamily Housing | 57875.0 | 57.0 | 69.0 | 1270905.2 | 3261399.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4532304.9 | 4953871.5 | 78.3 | 85.6 | 6983004.2 | 7465126.8 | 120.7 | 129.0 | 263.3 | 4.5 | 184.0 | 3.18 | 1 | 35 | 0 | 3.0 | NaN | NaN | NaN | NaN | 1.0 | Average | 1458 | NaN | 5.0 | 2.0 | Pull Alcove | 2.0 | 1.0 | 0.0 | Hardwood | No Impact | True | FORCED-AIR | NaN | NaN | NaN | 1981 | NaN | Good | AVERAGE | 0.0 | 0.0 | 1.0 | 552 | 0.0 |
There are! That's good: It means we can use that column easily for both Unit and non-Unit rows.
One final check before we move on: Let's see what categories the plumbing column has.
# Check values of plumbing column
merged_data.Systems_Plumbing.value_counts(dropna=False)
Systems_Plumbing NaN 32199 AVERAGE 22871 GOOD 10671 EXCELLENT 2189 NONE 948 FAIR 124 POOR 54 Name: count, dtype: int64
A lot of missing values, in keeping with what we've seen with the rest so far, but the categories are clean and usable.
b. EDA Focus on Residential Data only¶
As noted previously, the Residential and Non-Residential datasets have some notable differences in which columns have values. Residential rows are also notable for their common use of residential Units, which require special handling and aggregation. Let's take a closer look at the residential data.
# Check info of residential data
resid_df = merged_data[merged_data['BEUDO Category']=='Residential']
resid_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 30230 entries, 61 to 69058 Data columns (total 65 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Reporting ID 30230 non-null object 1 Data Year 30230 non-null datetime64[ns] 2 BEUDO Category 30230 non-null object 3 PD Parcel Living Area 30230 non-null float64 4 PD Parcel Units 30184 non-null float64 5 Year Built 30230 non-null datetime64[ns] 6 Buildings Included Count 30230 non-null int64 7 Primary Property Type - Self Selected 30230 non-null object 8 Property GFA - Self Reported (ft2) 30175 non-null float64 9 Reported Residential Units 29264 non-null float64 10 ENERGY STAR Score 28951 non-null float64 11 Electricity Use - Grid Purchase (kBtu) 30069 non-null float64 12 Natural Gas Use (kBtu) 30230 non-null float64 13 Fuel Oil #2 Use (kBtu) 30230 non-null float64 14 Diesel #2 Use (kBtu) 30230 non-null float64 15 Kerosene Use (kBtu) 30230 non-null float64 16 District Chilled Water Use (kBtu) 30230 non-null float64 17 District Steam Use (kBtu) 30230 non-null float64 18 Electricity Use - Generated From Onsite Renewable Systems (kWh) 30230 non-null float64 19 Site Energy Use (kBtu) 30230 non-null float64 20 Weather Normalized Site Energy Use (kBtu) 29124 non-null float64 21 Site EUI (kBtu/ft2) 30230 non-null float64 22 Weather Normalized Site EUI (kBtu/ft2) 29124 non-null float64 23 Source Energy Use (kBtu) 30230 non-null float64 24 Weather Normalized Source Energy Use (kBtu) 29124 non-null float64 25 Source EUI (kBtu/ft2) 30230 non-null float64 26 Weather Normalized Source EUI (kBtu/ft2) 29124 non-null float64 27 Total GHG Emissions (Metric Tons CO2e) 30230 non-null float64 28 Total GHG Emissions Intensity (kgCO2e/ft2) 30230 non-null float64 29 Water Use (All Water Sources) (kgal) 28163 non-null float64 30 Water Intensity (All Water Sources) (gal/ft2) 26941 non-null float64 31 BldgNum 30230 non-null int64 32 Unit 27788 non-null object 33 LandArea 30230 non-null int64 34 Exterior_NumStories 30230 non-null float64 35 Exterior_WallType 2338 non-null object 36 Exterior_WallHeight 2328 non-null float64 37 Exterior_RoofType 106 non-null object 38 Exterior_RoofMaterial 2272 non-null object 39 Exterior_FloorLocation 27274 non-null float64 40 Exterior_View 27788 non-null object 41 Interior_LivingArea 30230 non-null int64 42 Interior_NumUnits 2442 non-null float64 43 Interior_TotalRooms 27902 non-null float64 44 Interior_Bedrooms 27902 non-null float64 45 Interior_Kitchens 25913 non-null object 46 Interior_FullBaths 27902 non-null float64 47 Interior_HalfBaths 27902 non-null float64 48 Interior_Fireplaces 27902 non-null float64 49 Interior_Flooring 25353 non-null object 50 Interior_Layout 26942 non-null object 51 Interior_LaundryInUnit 27788 non-null object 52 Systems_HeatType 29327 non-null object 53 Systems_HeatFuel 2067 non-null object 54 Systems_CentralAir 114 non-null object 55 Systems_Plumbing 2328 non-null object 56 Condition_YearBuilt 30230 non-null int64 57 Condition_InteriorCondition 106 non-null object 58 Condition_OverallCondition 30206 non-null object 59 Condition_OverallGrade 30222 non-null object 60 Parking_Open 27902 non-null float64 61 Parking_Covered 27902 non-null float64 62 Parking_Garage 27788 non-null float64 63 UnfinishedBasementGross 30230 non-null int64 64 FinishedBasementGross 27902 non-null float64 dtypes: datetime64[ns](2), float64(38), int64(6), object(19) memory usage: 15.2+ MB
# Plot weather normalized site energy use intensity by overall condition
plt.figure(figsize=(10, 6))
sns.boxplot(x='Condition_OverallCondition', y=np.log(resid_df['Weather Normalized Site EUI (kBtu/ft2)']), data=resid_df,
hue='Condition_OverallCondition', palette='viridis', legend=False)
plt.xlabel('Overall Condition')
plt.ylabel('Log of Weather-Normalized Site EUI')
plt.title('Weather-Normalized Site EUI by Overall Building Condition')
plt.xticks(rotation=45)
plt.show()
c:\Users\Z004KYBK\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\arraylike.py:399: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
Site energy use by overall condition of the building. We might want to combine excellent, superior, and very good together.
Let's check the correlations between the predictors. We want to know where we need to adjust for multicollinearity before we do unnecessary exploration.
# Plot correlation heatmap for numerical features
df_num = resid_df.select_dtypes(include=['float64','int'])
correlation_matrix_large = df_num.corr()
# Set threshold 0.7 to be visible on map
threshold = 0.7
strong_corr = correlation_matrix_large[(correlation_matrix_large >= threshold) | (correlation_matrix_large <= -threshold)]
# Plot the heatmap for strong correlations only
plt.figure(figsize=(12, 10))
sns.heatmap(strong_corr, annot=True, cmap="coolwarm", fmt=".1f", mask=strong_corr.isnull())
plt.title("Heatmap of Strong Correlations (|r| > 0.5)")
plt.show()
PD parcel living area has strong correlation with PD parcel units, Property GFA selfreported (ftw), Reported Residential units. THere might be multicolieanrity between these features. We will consider drop if it's not improving models. Additional, it has strong correlation with Electricity use-Grid purchase, site energy use, source energy use, GHG emission.
PD parcel units, Property GFA, reported residential units have similar strong correlation with above energy usage. In addition, there is strong correlation with water usage.
Energy star score has strong negative correlation with the energy use intensity columns.
Electricity use grid purchase has similar correlation with energy use above.
Interior living area has strong correlation with total bedrooms and bathrooms.
We'll consider handling for these later, during preprocessing. For now, let's take another look at the log of our response variable and see how it compares to the numeric features of our residential dataset.
# Taking log of Weather Normalized Site EUI
resid_df = resid_df.copy() # Ensure no SettingWithCopyWarning
resid_df['Log_Weather Normalized Site EUI'] = np.log1p(resid_df['Weather Normalized Site EUI (kBtu/ft2)'])
# Plot numeric features vs log of weather normalized site EUI
numeric_columns = [col for col in resid_df.columns if resid_df[col].dtype in ['int64', 'float64']]
n_cols = 3
n_rows = math.ceil(len(numeric_columns) / n_cols)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 5, n_rows * 4))
axes = axes.flatten()
for i, col in enumerate(numeric_columns):
sns.scatterplot(
ax=axes[i],
x=resid_df[col],
y=resid_df['Log_Weather Normalized Site EUI']
)
axes[i].set_title(f'{col}')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Log_Weather Normalized Site EUI')
correlation = resid_df[col].corr(resid_df['Log_Weather Normalized Site EUI'])
axes[i].annotate(f'correlation = {correlation:.2f}', xy=(0.05, 0.9), xycoords='axes fraction')
for j in range(i + 1, len(axes)):
axes[j].axis('off')
fig.suptitle('Scatter Plots of Numeric Features vs. Log-Transformed Weather Normalized Site EUI',
fontsize=14, y=1)
plt.tight_layout()
plt.show()
c:\Users\Z004KYBK\AppData\Local\Programs\Python\Python310\lib\site-packages\numpy\lib\function_base.py:2897: RuntimeWarning: invalid value encountered in divide c /= stddev[:, None] c:\Users\Z004KYBK\AppData\Local\Programs\Python\Python310\lib\site-packages\numpy\lib\function_base.py:2898: RuntimeWarning: invalid value encountered in divide c /= stddev[None, :]
Strong Negative Correlations:¶
- ENERGY STAR Score (-0.50)
- Diesel #2 Use (kBtu) (-0.53) (it's possible that Diesel #2 is used in more efficient systems or as a supplemental fuel in buildings already designed for lower energy consumption. Need to see -> low data)
Moderate Positive Correlations:¶
- Natural Gas Use (kBtu) (0.32)
- District Steam Use (kBtu) (0.32)
- Site Energy Use (kBtu) (0.29)
- Electricity Use - Generated From Onsite Renewable Systems (kWh) (0.29) (this is probably due to the fact buildings having these technologies use large amount of energy overall. Need to see -> low data)
- Weather Normalized Site Energy Use (kBtu) (0.30)
- Total GHG Emissions (Metric Tons CO2e) (0.27)
Strong Positive Correlations:¶
- Site EUI (kBtu/ft2) (0.77)
- Source EUI (kBtu/ft2) (0.69)
- Total GHG Emissions Intensity (kgCO2e/ft2) (0.71)
- Fuel Oil #2 Use (kBtu) (0.90) (Need to see -> low data)
Weak or No Correlations:¶
Most other variables show very weak correlations close to 0, small sample sizes for some energy types due to missing data
Side notes:¶
We notice some odd distributions among the predictors.
- Condition_YearBuilt has a number of points with a year of 0, which is nonsensical.
- There are a few observations with disproportionately high parking space counts (presumably parking garages).
- The counts on some of the Interior columns suggest that there are several buildings that do not have unit-level property data but do contain many residential units.
The 0s in Condition_YearBuilt squashed the real values to the right side of the plot, so we'll do another with the 0s excluded.
# Compare Condition_YearBuilt to response
plt.figure(figsize=(12, 6))
sns.scatterplot(x=resid_df['Condition_YearBuilt'], y=resid_df['Log_Weather Normalized Site EUI'])
plt.xlim(1860, 2024)
plt.xticks(ticks=range(1860, 2025, 10), rotation=45)
plt.title('Condition_YearBuilt vs response')
plt.show()
Observation: When analyzing the relationship between the year built and EUI, an interesting trend emerges. From 1850 to 1950, EUI usage increases, while from 1950 to 2024, EUI shows a decline. A gradual change in building priorities? A change in building methods and technology? A change in what types of buildings are being built? Further exploration might tell us, but for now we have other priorities.
Let's check how the categorical features interact with our response variable.
# Drop some columns we want to exclude from plotting
resid_df.drop(columns=['Reporting ID','BEUDO Category','Unit'], axis=1,inplace=True)
object_columns = [col for col in resid_df.columns if resid_df[col].dtype in ['object']]
# Plot categorical features with Normalized site EUI
n_cols = 2
n_rows = math.ceil(len(object_columns) / n_cols)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 8, n_rows * 4))
axes = axes.flatten()
for i, col in enumerate(object_columns):
sns.violinplot(
ax=axes[i],
x=resid_df[col],
y=resid_df['Log_Weather Normalized Site EUI']
)
axes[i].set_title(f'{col}',fontsize=6)
axes[i].set_xlabel(col)
axes[i].tick_params(axis='x', labelsize=6, rotation=45)
axes[i].set_ylabel('Log_Weather Normalized Site EUI')
for j in range(i + 1, len(axes)):
axes[j].axis('off')
fig.suptitle('Violin Plots of Categorical Features vs. Log-Transformed Weather Normalized Site EUI',
fontsize=12, y=1)
plt.tight_layout()
plt.show()
Primary Property Type vs Log of Site EUI
Observation: Different property types exhibit distinct distributions of EUI. However, the majority of properties are classified as multifamily housing, with only a few classified as colleges. Multifamily housing spans a broad range of EUI usage.
Exterior Wall Type vs Log of Site EUI
Observation: Brick veneer, metal glass, and asphalt shingle wall types appear to have lower average EUI usage. The "other" category seems to have a large number of data points, which may need further exploration.
Exterior Roof Type vs Log of Site EUI
Observation: Gable and flat roofs appear to have higher EUI usage. However, there seems to be limited data available for these types.
Exterior Roof Material vs Log of Site EUI
Observation: Slate clay and wood shingle roof materials seem to have lower EUI on average. Rubber membrane roofs show higher EUI usage but also contain many outliers.
Exterior View vs Log of Site EUI
Observation: The condition of the exterior view does not appear to significantly impact EUI usage.
Interior Kitchen vs Log of Site EUI
Observation: The kitchen type does not seem to make a notable difference in EUI usage. This category contains both numerical and categorical data. Since it does not show much variation, it can be considered for removal in later analysis.
Interior Flooring vs Log of Site EUI
Observation: Softwood flooring appears to have the lowest EUI usage, but there are some significant outliers at the higher end. Marble flooring tends to have higher EUI on average.
Interior Layout vs Log of Site EUI
Observation: "Thru units" exhibit slightly higher EUI usage compared to other types, while "superior" layouts appear to have lower EUI usage. It is worth exploring what "superior" specifically means in this context. Based on the name, we can assume it refers to the best layout.
Interior Laundry in Units vs Log of Site EUI
Observation: Properties without in-unit laundry have slightly higher EUI usage, but the difference is minimal.
System Heat Type vs Log of Site EUI
Observation: There are several heat types in the raw data. Grouping similar types together might be beneficial. Further EDA should be conducted after data cleaning and imputation.
System Heat Fuel vs Log of Site EUI
Observation: The type of heating fuel does not seem to significantly impact EUI usage on average. However, electric heating shows slightly lower EUI on average.
System Central Air vs Log of Site EUI
Observation: Properties without central air systems tend to have EUI values concentrated around 4,000. Properties with central air systems appear to have lower EUI usage on average, though there is less data available for this category.
5. Data preprocessing and engineering (Property aggregation)¶
The rows for individual buildings and units in the Property Database dataset need to be aggregated up to property level, but they can't all be aggregated in the same way. Some are categorical, some need to be summed, some need to be averaged, and so on. In this section we define what needs to happen to each.
Explore needed aggregation and imputation¶
Let's go down the list.
- BldgNum: Building number within the property. Present even if there's only one building. Useful for verifying we don't have duplicates. Will be dropped during aggregation.
- Unit: Unit number within building. NaN if the building doesn't have recorded subunits. Maybe aggregate by this and then aggregate the buildings?
- LandArea: Appears to be land area of the whole parcel? Maybe? The value is shared by all buildings in a parcel, so presumably it's already the aggregate. LandArea is 0 for rows with Unit values. Likely to be problematic for prediction. Probably drop this.
- Exterior_NumStories: For buildings without Units, this is the number of stories the building has. For unit rows, this appears to be the story that unit is on. Take maximum for unit aggregation, take weighted average for building aggregation.
- Exterior_WallType: Take weighted mode. Could say "mixed" if the proportions are pretty even? Always empty for unit data. Could add "apartment/multifamily" category for those.
- Exterior_WallHeight: Average height of floors in the building. Always empty for unit data. Take weighted average for building data. Use median residential floor height for unit data. (12 feet?)
- Exterior_RoofType: While definitely interesting for energy use, almost no rows have values in this column. Too much missing data to be useful. We could impute to some extent from roof material, but it would be so 1:1 that we'd just get multicollinearity. We'll drop this.
- Exterior_RoofMaterial: Very interesting, and a lot of rows have it. Always empty for unit rows. Take weighted mode. If proportions are too similar, set as "MIXED".
- Exterior_FloorLocation: Floor level for condominium units. Always empty for non-unit rows. How this interacts with Exterior_NumStories is unclear. Maybe units with Exterior_NumStories higher than 1 span multiple stories, and Exterior_FloorLocation is what floor they start on? Or what floor the entrance is on? Treatment might not be uniform. Consider taking the maximum of the two columns for each row.
- Exterior_View: Unlikely to be relevant in any meaningful way. Theoretically could mean something about the window quality or placement, but that doesn't seem to be what it's going for. Drop this.
- Interior_LivingArea: Finished area of building. Separate for each unit and building. Useful for weighted aggregation. Aggregate by taking sum. As sanity check, want sum to be similar to PD Parcel Living Area.
- Interior_NumUnits: Number of units in commercial/apartment building. Mutually exclusive with Unit column. Used for properties where data wasn't separately recorded for each unit? Set to 1 for unit rows, then take sum to aggregate to building. For property-level aggregation, could have one column for mean numunits and one for total numunits? Doubt it's an important predictor.
- Interior_TotalRooms: "Total number of rooms in a commercial or residential building". Actual meaning is more ambiguous. There are residential rows with TotalRooms of 0.0. There are non-residential rows with 36 rooms. Unit and non-unit rows both have values for this column. Needs deeper investigation. To aggregate, take sum.
- Interior_Bedrooms: Total bedroom count. Take sum.
- Interior_Kitchens: Confused mix of kitchen types and counts of kitchens. Kitchen types are likely irrelevant. Convert kitchen types to 1, then aggregate via sum.
- Interior_FullBaths: Full bathroom count (i.e. bathrooms with all of toilet, sink, bathtub, shower). Take sum. Most non-unit rows have NA for this column. Probably impute from other columns.
- Interior_HalfBaths: Half bathroom count (i.e. bathrooms with only toilet and sink). Take sum. Can impute as with FullBaths.
- Interior_Fireplaces: Aggregate via sum.
- Interior_Flooring: Weighted mode. If similar proportions, use "Mixed".
- Interior_Layout: Only relevant to condo units. Could have impact on individual condo unit energy use, but we don't measure that. Drop.
- Interior_LaundryInUnit: Only relevant to condo units. Can sum to get total laundry machine count, but realistically it doesn't give us anything useful, not least because there can be a shared laundry that this variable doesn't catch. Plus, as long as there's laundry in the building, likelihood is that water/energy use will be the same as it would be for in-unit laundry. Drop.
- Systems_HeatType: Few missing values. Need to correct some entries (use the capitalized categories). Take weighted mode, as per usual. Try to impute missing values.
- Systems_HeatFuel: Not a lot of good values in this column. May be imputable from fuel use columns in energy use dataset. If we keep this, take weighted mode.
- Systems_CentralAir: Almost no values for this column. Not imputable, as much as I wish it were. Drop.
- Systems_Plumbing: Rating of plumbing system. Unclear what this means in practical terms. Seems to just be plumbing condition? Investigation shows that while plumbing condition has no strong relationship with energy use, plumbing of "None" tends to mean significantly lower energy use. Turn into a binary plumbing or no plumbing, then take weighted average when aggregating.
- Condition_YearBuilt: Year the building was built. Some rows have 0s for some reason. Matches Year Built for nearly all rows, but there are a few anomalies that need closer inspection. Handle those anomalies, then drop in favor of Data Year.
- Condition_InteriorCondition: Very few rows have this filled in. Not usefully imputable. Drop.
- Condition_OverallCondition: Has values for almost all rows. Impute or drop remainder. Need to figure out what "superior" rating means. Ordinal, so take weighted median. May convert to numeric?
- Condition_OverallGrade: Almost no missing values. Impute or drop remainder. Need to clean up categories, unify on capitalized category names. Ordinal, so take weighted median. May convert to numeric?
- Parking_Open: Number of open parking spaces for residential building or condo unit. I can't imagine this has any impact on energy use. Recommend drop.
- Parking_Covered: Number of covered parking spaces for residential building or condo unit. Probably low impact on energy use, but maybe some. Nearly all rows have NA or 0, a few have 1 or 2, five have 45. Extremely likely to cause overfitting. Recommend drop.
- Parking_Garage: Number of garage parking spaces for residential building or condominium unit. Wild variety in values. At least one of the buildings in the dataset is a 99-car parking garage. I don't know what to do with that. Leave it in, I guess? If we're looking at only residential buildings, we still get at least one building which has an internal garage that's marked as a unit (G1). Aggregate via sum.
- UnfinishedBasementGross: Not included in metadata spreadsheet, but presumably is total amount of basement without finishing (wood, flooring, etc.). Aggregate via sum.
- FinishedBasementGross: Not included in metadata spreadsheet, but presumably is total amount of basement with finishing (wood, flooring, etc.). Aggregate via sum.
Note: Weighting is, if possible, by Interior_LivingArea and Exterior_WallHeight. Larger buildings and units will have greater impact on the overall property energy use. Where Exterior_WallHeight is not available, it's acceptable to weight by Interior_LivingArea alone.
Now that we've made our plans for cleaning up the Property Database columns, we need to implement them.
Clean and impute missing values¶
We'll start by dropping a number of columns that we've determined are either irrelevant, redundant, or too sparse to use.
# Drop unwanted columns
property_dropcols = ['LandArea', 'Exterior_RoofType', 'Exterior_View', 'Interior_Layout', 'Interior_LaundryInUnit', 'Systems_CentralAir',
'Condition_InteriorCondition', 'Parking_Open', 'Parking_Covered']
merged_clean = merged_data.drop(property_dropcols, axis=1)
merged_clean.head()
| Reporting ID | Data Year | BEUDO Category | PD Parcel Living Area | PD Parcel Units | Year Built | Buildings Included Count | Primary Property Type - Self Selected | Property GFA - Self Reported (ft2) | Reported Residential Units | ENERGY STAR Score | Electricity Use - Grid Purchase (kBtu) | Natural Gas Use (kBtu) | Fuel Oil #2 Use (kBtu) | Diesel #2 Use (kBtu) | Kerosene Use (kBtu) | District Chilled Water Use (kBtu) | District Steam Use (kBtu) | Electricity Use - Generated From Onsite Renewable Systems (kWh) | Site Energy Use (kBtu) | Weather Normalized Site Energy Use (kBtu) | Site EUI (kBtu/ft2) | Weather Normalized Site EUI (kBtu/ft2) | Source Energy Use (kBtu) | Weather Normalized Source Energy Use (kBtu) | Source EUI (kBtu/ft2) | Weather Normalized Source EUI (kBtu/ft2) | Total GHG Emissions (Metric Tons CO2e) | Total GHG Emissions Intensity (kgCO2e/ft2) | Water Use (All Water Sources) (kgal) | Water Intensity (All Water Sources) (gal/ft2) | BldgNum | Unit | Exterior_NumStories | Exterior_WallType | Exterior_WallHeight | Exterior_RoofMaterial | Exterior_FloorLocation | Interior_LivingArea | Interior_NumUnits | Interior_TotalRooms | Interior_Bedrooms | Interior_Kitchens | Interior_FullBaths | Interior_HalfBaths | Interior_Fireplaces | Interior_Flooring | Systems_HeatType | Systems_HeatFuel | Systems_Plumbing | Condition_YearBuilt | Condition_OverallCondition | Condition_OverallGrade | Parking_Garage | UnfinishedBasementGross | FinishedBasementGross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B1600 | 2016-01-01 | Non-Residential | 57000.0 | 1.0 | 1910-01-01 | 1 | Office | 38000.0 | 0.0 | NaN | 32880.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 32880.5 | 32880.5 | 0.9 | 0.9 | 92065.5 | 92065.5 | 2.4 | 2.4 | 0.0 | 0.0 | NaN | NaN | 1 | NaN | 5.0 | BRICK | 12.0 | TAR-GRAVEL | NaN | 33904 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STEAM | Other | AVERAGE | 1910 | Average | GOOD | NaN | 0 | NaN |
| 1 | B1360 | 2017-01-01 | Non-Residential | 1462119.0 | 18.0 | 2010-01-01 | 1 | Parking | 223000.0 | 0.0 | NaN | 1876014.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 1 | NaN | 3.0 | BRICK | 12.0 | SLATE-CLAY | NaN | 42105 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | HOT-WATER | NaN | AVERAGE | 1840 | Good | VERY GOOD | NaN | 0 | NaN |
| 2 | B1360 | 2017-01-01 | Non-Residential | 1462119.0 | 18.0 | 2010-01-01 | 1 | Parking | 223000.0 | 0.0 | NaN | 1876014.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 2 | NaN | 2.0 | WOOD-SHN-SHK | 12.0 | SLATE-CLAY | NaN | 6960 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | HOT-WATER | NaN | AVERAGE | 1870 | Very Good | EXCELLENT | NaN | 2784 | NaN |
| 3 | B1360 | 2017-01-01 | Non-Residential | 1462119.0 | 18.0 | 2010-01-01 | 1 | Parking | 223000.0 | 0.0 | NaN | 1876014.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 3 | NaN | 2.0 | OTHER | 12.0 | ROLL-ROOFING | NaN | 50874 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | PACKAGE-A/C | NaN | GOOD | 1965 | Very Good | VERY GOOD | NaN | 0 | NaN |
| 4 | B1360 | 2017-01-01 | Non-Residential | 1462119.0 | 18.0 | 2010-01-01 | 1 | Parking | 223000.0 | 0.0 | NaN | 1876014.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 4 | NaN | 4.5 | BRICK | 12.0 | SLATE-CLAY | NaN | 66082 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | HOT-WATER | Gas | GOOD | 1840 | Good | VERY GOOD | NaN | 13912 | NaN |
Next, we'll consolidate kitchen types so it can be used as a numeric feature, then convert plumbing to a simple "has/doesn't have plumbing" binary.
# Convert kitchen types to 1
merged_clean['Interior_Kitchens'] = [1 if x in ['Pull Alcove', 'Galley', 'Full Eat In'] else x for x in merged_clean.Interior_Kitchens]
merged_clean['Interior_Kitchens'] = [0 if x in ["None", None, "0"] else x for x in merged_clean.Interior_Kitchens]
merged_clean['Interior_Kitchens'] = pd.to_numeric(merged_clean['Interior_Kitchens'], errors='coerce')
# Convert plumbing to binary
merged_clean.loc[(merged_clean['Systems_Plumbing'] != "NONE") & merged_clean.Systems_Plumbing.notna(), 'Systems_Plumbing'] = 1
merged_clean.loc[merged_clean['Systems_Plumbing'] == "NONE", 'Systems_Plumbing'] = 0
Rows with a Unit value don't necessarily have a NumUnits value. Since we'll be aggregating them to building level and then using them alongside rows that do have NumUnits counts, we'll fill all rows with a Unit value to have a NumUnits of 1. We'll sum during aggregation to get the number of units for the building.
# Set Interior_NumUnits to 1 for unit rows.
merged_clean.loc[merged_clean.Unit.notna() & merged_clean.Interior_NumUnits.isna(), 'Interior_NumUnits'] = 1
Next item. There are 56 rows with 0 in Condition_YearBuilt. The lots containing these observations have similar Condition_YearBuilt values, so we impute using the mode of Condition_YearBuilt for the same lot. Reporting IDs with no Condition_YearBuilt values are imputed with the overall mode.
# Add column marking that values were/will be imputed
merged_clean['Imputed_Condition_YearBuilt'] = [1 if x == 0 else 0 for x in merged_clean['Condition_YearBuilt']]
# Impute Condition_YearBuilt zeros with mode of same reporting ID
for rid in merged_clean['Reporting ID'].unique():
group = merged_clean.loc[merged_clean['Reporting ID'] == rid]
year_built_mode = group['Condition_YearBuilt'].mode()[0]
merged_clean.loc[(merged_clean['Reporting ID'] == rid) & (merged_clean['Condition_YearBuilt'] == 0), 'Condition_YearBuilt'] = year_built_mode
# Impute remaining zeros with the overall mode
overall_mode = int(merged_clean['Condition_YearBuilt'].mode()[0])
merged_clean.loc[merged_clean['Condition_YearBuilt'] == 0, 'Condition_YearBuilt'] = overall_mode
If possible, we'd like to impute Exterior_WallType. Unfortunately, we are seeing that the majority of Residential rows don't have values for this column. For commercial or industrial buildings, exterior wall types might be documented due to structural and regulatory reasons (e.g., compliance with fire safety or energy efficiency standards). For residential buildings, especially small-scale or multi-family homes, this data might not be routinely recorded. This could explain why Exterior_WallType is often missing for "Residential."
We did the chi-square test to verify if there is association between wall type and property type so we can use that to compute the missing. If p value is less than 0.05, it means there is strong association between exterior wall type and property type.
# Check count of NA values in roof material by BEUDO category
na_counts2 = merged_clean.groupby('BEUDO Category')['Exterior_RoofMaterial'].apply(lambda x: x.isna().sum())
na_counts2
BEUDO Category Municipal 1 Non-Residential 4241 Residential 27958 Name: Exterior_RoofMaterial, dtype: int64
# Plot the distribution of Exterior_WallType across Property Type
top_property_types = merged_clean['Primary Property Type - Self Selected'].value_counts().head(5).index
filtered_data = merged_clean[merged_clean['Primary Property Type - Self Selected'].isin(top_property_types)]
# Create the count plot
sns.countplot(
data=filtered_data,
x='Primary Property Type - Self Selected',
hue='Exterior_WallType'
)
plt.xticks(rotation=45)
plt.title("Distribution of Exterior Wall Type by Top 5 Property Types")
plt.show()
# Perform the chi-square test, if p value is less than 0, it means there is strong association between exterior wall type and property type
non_missing_data = merged_clean[merged_clean['Exterior_WallType'].notna()]
contingency_table = pd.crosstab(non_missing_data['Primary Property Type - Self Selected'], non_missing_data['Exterior_WallType'])
chi2, p, dof, expected = chi2_contingency(contingency_table)
# Print test statistics
print(f"Chi-Square Test Statistic: {chi2:.3f}")
print(f"P-Value: {p:.3f}")
Chi-Square Test Statistic: 30020.532 P-Value: 0.000
There is clearly a strong relationship between property type and exterior wall type. We may be able to use this to get better imputation of wall type. We'll define a function to calculate the mode for each property type, defaulting to "OTHER" if no better option is available. In keeping with best practices, we will also add a column indicating whether Exterior_WallType was imputed.
# Define a function to handle mode calculation with a fallback
def get_mode_or_default(series, default="OTHER"):
return series.mode()[0] if not series.mode().empty else default
# Calculate the most common wall type (mode) for each property type
wall_type_mode_by_property = merged_clean.groupby('Primary Property Type - Self Selected')['Exterior_WallType'].agg(lambda x: get_mode_or_default(x))
# Add column marking that values were/will be imputed
merged_clean['Imputed_Exterior_WallType'] = pd.isna(merged_clean['Exterior_WallType']).astype(int)
# Impute missing values with the mode of the corresponding property type
merged_clean['Exterior_WallType'] = merged_clean.apply(
lambda row: wall_type_mode_by_property[row['Primary Property Type - Self Selected']]
if pd.isna(row['Exterior_WallType']) else row['Exterior_WallType'],
axis=1
)
# Check updated value counts
merged_clean['Exterior_WallType'].value_counts()
Exterior_WallType BRICK 52907 OTHER 5117 WOOD-SHN-SHK 2799 STUCCO 2568 FRAME-CLAPBD 1686 CONCR-BLOCK 1173 CONCR PANEL 819 STONE-VENEER 748 BRICK-VENEER 721 METAL-GLASS 388 METAL 113 TILT PANEL 12 ASPHALT-SHNG 3 ASBSTOS-SHNG 2 Name: count, dtype: int64
Those last three categories have too few observations. They'll just overfit. We'll join them to OTHER.
# Join last three categories to OTHER
merged_clean['Exterior_WallType'] = ['OTHER' if x in ['ASPHALT-SHNG','ASBSTOS-SHNG','TILT PANEL'] else x for x in merged_clean['Exterior_WallType']]
Given the noted relationships, we'll impute missing roof materials by both primary property type and exterior wall type.We can apply a similar approach to Exterior_RoofMaterial, but it looks like Exterior_RoofMaterial has association with both property type and exterior wall type.
# Make values to fill in missing roof materials
roof_mode_by_combination = merged_clean.groupby(['Primary Property Type - Self Selected', 'Exterior_WallType'])['Exterior_RoofMaterial'].agg(lambda x: x.mode()[0] if not x.mode().empty else "OTHER")
# Add column marking that values were/will be imputed
merged_clean['Imputed_Exterior_RoofMaterial'] = pd.isna(merged_clean['Exterior_RoofMaterial']).astype(int)
# Impute missing values
merged_clean['Exterior_RoofMaterial'] = merged_clean.apply(
lambda row: roof_mode_by_combination.get((row['Primary Property Type - Self Selected'], row['Exterior_WallType']), "OTHER")
if pd.isna(row['Exterior_RoofMaterial']) else row['Exterior_RoofMaterial'],
axis=1
)
# Check roof material values after imputation
merged_clean['Exterior_RoofMaterial'].value_counts()
Exterior_RoofMaterial TAR-GRAVEL 41955 SLATE-CLAY 11910 RUBBER-MEMBRAN 3997 METAL 3635 ROLL-ROOFING 3047 ASPHALT-SHNG 2768 OTHER 1175 WOOD-SHINGLE 218 TILE 215 COMPOSITION 136 Name: count, dtype: int64
There are a few low-prevalence categories remaining, but none so low that they should suffer terrible overfitting. On to the next item.
The Condition columns are currently ordinal text columns. We want to make them numeric so the model can treat them as ordinal. By our previous exploration, the "Superior" category seems to have similar energy use intensities to "Good", so for simplicity and clarity we'll merge those groups together.
merged_clean.Condition_OverallCondition.value_counts(dropna=False)
Condition_OverallCondition Excellent 18873 Good 18555 Very Good 16364 Average 13158 Fair 1578 Poor 213 Superior 174 NaN 141 Name: count, dtype: int64
# "Superior" seems redundant and unclear, so we'll merge it with "Good"
merged_clean['Condition_OverallCondition'] = ['Good' if x == 'Superior' else x for x in merged_clean.Condition_OverallCondition]
# Impute missing with most common category for same reporting ID
merged_clean.reset_index(drop=True, inplace=True)
merged_clean['Imputed_OverallCondition'] = merged_clean.Condition_OverallCondition.isna().astype(int)
for i in range(merged_clean.shape[0]):
if pd.isna(merged_clean.loc[i, 'Condition_OverallCondition']):
merged_clean.loc[i, 'Condition_OverallCondition'] = get_mode_or_default(merged_clean.loc[merged_clean['Reporting ID'] == merged_clean.loc[i, 'Reporting ID'], 'Condition_OverallCondition'], 'Good')
# Convert Condition_OverallCondition to numeric
condition_order = ['Poor', 'Fair', 'Average', 'Good', 'Very Good', 'Excellent']
merged_clean['Condition_OverallCondition'] = [condition_order.index(x) for x in merged_clean.Condition_OverallCondition]
merged_clean.Condition_OverallGrade.value_counts(dropna=False)
Condition_OverallGrade GOOD 24379 VERY GOOD 14509 EXCELLENT 9731 AVERAGE 8705 GOOD VERY GOOD 8661 VERY GOOD EXCELLENT 1628 FAIR 728 SPECIAL 630 POOR 68 NaN 17 Name: count, dtype: int64
Grade has a "special" category that doesn't fit into the main ordering. Let's see how it compares to the other categories.
# Plot weather normalized site energy use intensity by overall condition
plt.figure(figsize=(8, 6))
sns.boxplot(x='Condition_OverallGrade', y=np.log(merged_clean['Weather Normalized Site EUI (kBtu/ft2)']), data=merged_clean,
hue='Condition_OverallGrade', palette='viridis', legend=False)
plt.xlabel('Overall Grade')
plt.ylabel('Log of Weather-Normalized Site EUI')
plt.title('Weather-Normalized Site EUI by Overall Building Grade')
plt.xticks(rotation=45)
plt.show()
c:\Users\Z004KYBK\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\arraylike.py:399: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
It seems like SPECIAL is pretty much equivalent to VERY GOOD EXCELLENT. For clarity, we'll merge the former into the latter.
# Apply the same procedure to Condition_OverallGrade
# Replace SPECIAL with VERY GOOD EXCELLENT
merged_clean['Condition_OverallGrade'] = ['VERY GOOD EXCELLENT' if x == 'SPECIAL' else x for x in merged_clean.Condition_OverallGrade]
# Impute missing with most common category for same reporting ID
merged_clean.reset_index(drop=True, inplace=True)
merged_clean['Imputed_OverallGrade'] = merged_clean.Condition_OverallGrade.isna().astype(int)
for i in range(merged_clean.shape[0]):
if pd.isna(merged_clean.loc[i, 'Condition_OverallGrade']):
merged_clean.loc[i, 'Condition_OverallGrade'] = get_mode_or_default(merged_clean.loc[merged_clean['Reporting ID'] == merged_clean.loc[i, 'Reporting ID'], 'Condition_OverallGrade'], 'GOOD')
# Convert Condition_OverallGrade to numeric
grade_order = ['POOR', 'FAIR', 'AVERAGE', 'GOOD', 'GOOD VERY GOOD', 'VERY GOOD', 'VERY GOOD EXCELLENT', 'EXCELLENT']
merged_clean['Condition_OverallGrade'] = [grade_order.index(x) for x in merged_clean.Condition_OverallGrade]
# Check that the modified columns have come out as intended
merged_clean.head()
| Reporting ID | Data Year | BEUDO Category | PD Parcel Living Area | PD Parcel Units | Year Built | Buildings Included Count | Primary Property Type - Self Selected | Property GFA - Self Reported (ft2) | Reported Residential Units | ENERGY STAR Score | Electricity Use - Grid Purchase (kBtu) | Natural Gas Use (kBtu) | Fuel Oil #2 Use (kBtu) | Diesel #2 Use (kBtu) | Kerosene Use (kBtu) | District Chilled Water Use (kBtu) | District Steam Use (kBtu) | Electricity Use - Generated From Onsite Renewable Systems (kWh) | Site Energy Use (kBtu) | Weather Normalized Site Energy Use (kBtu) | Site EUI (kBtu/ft2) | Weather Normalized Site EUI (kBtu/ft2) | Source Energy Use (kBtu) | Weather Normalized Source Energy Use (kBtu) | Source EUI (kBtu/ft2) | Weather Normalized Source EUI (kBtu/ft2) | Total GHG Emissions (Metric Tons CO2e) | Total GHG Emissions Intensity (kgCO2e/ft2) | Water Use (All Water Sources) (kgal) | Water Intensity (All Water Sources) (gal/ft2) | BldgNum | Unit | Exterior_NumStories | Exterior_WallType | Exterior_WallHeight | Exterior_RoofMaterial | Exterior_FloorLocation | Interior_LivingArea | Interior_NumUnits | Interior_TotalRooms | Interior_Bedrooms | Interior_Kitchens | Interior_FullBaths | Interior_HalfBaths | Interior_Fireplaces | Interior_Flooring | Systems_HeatType | Systems_HeatFuel | Systems_Plumbing | Condition_YearBuilt | Condition_OverallCondition | Condition_OverallGrade | Parking_Garage | UnfinishedBasementGross | FinishedBasementGross | Imputed_Condition_YearBuilt | Imputed_Exterior_WallType | Imputed_Exterior_RoofMaterial | Imputed_OverallCondition | Imputed_OverallGrade | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B1600 | 2016-01-01 | Non-Residential | 57000.0 | 1.0 | 1910-01-01 | 1 | Office | 38000.0 | 0.0 | NaN | 32880.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 32880.5 | 32880.5 | 0.9 | 0.9 | 92065.5 | 92065.5 | 2.4 | 2.4 | 0.0 | 0.0 | NaN | NaN | 1 | NaN | 5.0 | BRICK | 12.0 | TAR-GRAVEL | NaN | 33904 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | STEAM | Other | 1 | 1910 | 2 | 3 | NaN | 0 | NaN | 0 | 0 | 0 | 0 | 0 |
| 1 | B1360 | 2017-01-01 | Non-Residential | 1462119.0 | 18.0 | 2010-01-01 | 1 | Parking | 223000.0 | 0.0 | NaN | 1876014.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 1 | NaN | 3.0 | BRICK | 12.0 | SLATE-CLAY | NaN | 42105 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | HOT-WATER | NaN | 1 | 1840 | 3 | 5 | NaN | 0 | NaN | 0 | 0 | 0 | 0 | 0 |
| 2 | B1360 | 2017-01-01 | Non-Residential | 1462119.0 | 18.0 | 2010-01-01 | 1 | Parking | 223000.0 | 0.0 | NaN | 1876014.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 2 | NaN | 2.0 | WOOD-SHN-SHK | 12.0 | SLATE-CLAY | NaN | 6960 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | HOT-WATER | NaN | 1 | 1870 | 4 | 7 | NaN | 2784 | NaN | 0 | 0 | 0 | 0 | 0 |
| 3 | B1360 | 2017-01-01 | Non-Residential | 1462119.0 | 18.0 | 2010-01-01 | 1 | Parking | 223000.0 | 0.0 | NaN | 1876014.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 3 | NaN | 2.0 | OTHER | 12.0 | ROLL-ROOFING | NaN | 50874 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | PACKAGE-A/C | NaN | 1 | 1965 | 4 | 5 | NaN | 0 | NaN | 0 | 0 | 0 | 0 | 0 |
| 4 | B1360 | 2017-01-01 | Non-Residential | 1462119.0 | 18.0 | 2010-01-01 | 1 | Parking | 223000.0 | 0.0 | NaN | 1876014.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 4 | NaN | 4.5 | BRICK | 12.0 | SLATE-CLAY | NaN | 66082 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | HOT-WATER | Gas | 1 | 1840 | 3 | 5 | NaN | 13912 | NaN | 0 | 0 | 0 | 0 | 0 |
With those items made usable, we move on to the next task.
# Check remaining NAs
na_counts = merged_clean.isna().sum()
na_counts[na_counts > 0]
PD Parcel Units 2715 Property GFA - Self Reported (ft2) 58 Reported Residential Units 966 ENERGY STAR Score 26809 Electricity Use - Grid Purchase (kBtu) 1248 Weather Normalized Site Energy Use (kBtu) 6696 Weather Normalized Site EUI (kBtu/ft2) 6393 Weather Normalized Source Energy Use (kBtu) 6696 Weather Normalized Source EUI (kBtu/ft2) 6393 Water Use (All Water Sources) (kgal) 6700 Water Intensity (All Water Sources) (gal/ft2) 8375 Unit 37114 Exterior_WallHeight 32146 Exterior_FloorLocation 37958 Interior_TotalRooms 36910 Interior_Bedrooms 36910 Interior_Kitchens 39728 Interior_FullBaths 36910 Interior_HalfBaths 36910 Interior_Fireplaces 36910 Interior_Flooring 39978 Systems_HeatType 1716 Systems_HeatFuel 52203 Systems_Plumbing 32199 Parking_Garage 37114 FinishedBasementGross 36910 dtype: int64
Remaining aggregation and imputation:
Exterior_NumStories: For buildings without Units, this is the number of stories the building has. For unit rows, this appears to be the story that unit is on. Take maximum for unit aggregation, take weighted average for building aggregation.
Interior_Flooring: Weighted mode. If similar proportions, use "Mixed".
Systems_HeatType: Few missing values. Take weighted mode, as per usual. Try to impute missing values.
Systems_HeatFuel: Not a lot of good values in this column. May be imputable from fuel use columns in energy use dataset. If we keep this, take weighted mode.
Exterior_WallHeight: Average height of floors in the building. Always empty for unit data. Take weighted average for building data. Use median residential floor height for unit data. (12 feet?)
Systems_Plumbing: After correction, take weighted average.
Interior_LivingArea: Finished area of building. Separate for each unit and building. Useful for weighted aggregation. Aggregate by taking sum. As sanity check, want sum to be similar to PD Parcel Living Area.
Interior_TotalRooms: "Total number of rooms in a commercial or residential building". Actual meaning is more ambiguous. There are residential rows with TotalRooms of 0.0. There are non-residential rows with 36 rooms. Unit and non-unit rows both have values for this column. Needs deeper investigation. To aggregate, take sum.
Parking_Garage: Number of garage parking spaces for residential building or condominium unit. Wild variety in values. At least one of the buildings in the dataset is a 99-car parking garage. I don't know what to do with that. Leave it in, I guess? If we're looking at only residential buildings, we still get at least one building which has an internal garage that's marked as a unit (G1). Aggregate via sum.
Interior_NumUnits: take sum to aggregate to building. For property-level aggregation, could have one column for mean numunits and one for total numunits? Doubt it's an important predictor.
Columns to drop:
Drop, as not important! Exterior_FloorLocation: Floor level for condominium units. Always empty for non-unit rows. How this interacts with Exterior_NumStories is unclear. Maybe units with Exterior_NumStories higher than 1 span multiple stories, and Exterior_FloorLocation is what floor they start on? Or what floor the entrance is on? Treatment might not be uniform. Consider taking the maximum of the two columns for each row.
Drop, as not important! Interior_FullBaths: Full bathroom count (i.e. bathrooms with all of toilet, sink, bathtub, shower). Take sum. Most non-unit rows have NA for this column. Probably impute from other columns.
Drop, as not important! Interior_HalfBaths: Half bathroom count (i.e. bathrooms with only toilet and sink). Take sum. Can impute as with FullBaths.
Drop, as not important! Interior_Kitchens: Confused mix of kitchen types and counts of kitchens. Kitchen types are likely irrelevant. then aggregate via sum.
We also note that we still have a number of rows with Site EUI but no Weather Normalized Site EUI. We should be able to impute the latter by finding the year-specific normalization coefficient among the rows we do have normalized data for.
# Impute weather-normalized site EUI for rows that only have non-normalized
# Framework suggested by ChatGPT
# Step 1: Calculate normalization coefficients
# Filter rows with non-missing values for both Site EUI and Weather-Normalized Site EUI
merged_clean_valid = merged_clean.dropna(subset=['Site EUI (kBtu/ft2)', 'Weather Normalized Site EUI (kBtu/ft2)'])
# Calculate normalization coefficient for each year
coefficients = merged_clean_valid.groupby('Data Year').apply(
lambda group: (group['Weather Normalized Site EUI (kBtu/ft2)'] / group['Site EUI (kBtu/ft2)']).mean(),
include_groups=False
).to_dict()
# Step 2: Impute missing values
def impute_weather_normalized(row):
if pd.isna(row['Weather Normalized Site EUI (kBtu/ft2)']):
return row['Site EUI (kBtu/ft2)'] * coefficients.get(row['Data Year'], 1) # Default coefficient of 1 if year is missing
return row['Weather Normalized Site EUI (kBtu/ft2)']
merged_clean['Weather Normalized Site EUI (kBtu/ft2)'] = merged_clean.apply(impute_weather_normalized, axis=1)
Aggregate¶
Now that imputation is sufficiently complete, we will move on to aggregation. The types of aggregation needed vary. Some need weighted medians, some need sums, some need weighted modes, and so on. In order to account for the differing sizes of the buildings or units in a map lot, we weight them by their internal living area.
# Take sums of columns that need that
sumcols = ['Interior_LivingArea', 'Interior_TotalRooms', 'Interior_Bedrooms', 'Interior_Kitchens', 'Interior_Fireplaces',
'Parking_Garage', 'UnfinishedBasementGross', 'FinishedBasementGross', 'Interior_NumUnits']
aggdf = merged_clean.groupby(['Reporting ID', 'Data Year', 'BldgNum'])[sumcols].agg('sum').reset_index()
aggdf.Interior_Kitchens = aggdf.Interior_Kitchens.astype('int32')
aggdf.head(10)
| Reporting ID | Data Year | BldgNum | Interior_LivingArea | Interior_TotalRooms | Interior_Bedrooms | Interior_Kitchens | Interior_Fireplaces | Parking_Garage | UnfinishedBasementGross | FinishedBasementGross | Interior_NumUnits | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B1000 | 2016-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 |
| 1 | B1000 | 2017-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 |
| 2 | B1000 | 2018-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 |
| 3 | B1000 | 2019-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 |
| 4 | B1000 | 2020-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 |
| 5 | B1000 | 2021-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 |
| 6 | B1000 | 2022-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 |
| 7 | B1001 | 2016-01-01 | 1 | 57545 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 51.0 |
| 8 | B1001 | 2017-01-01 | 1 | 57545 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 51.0 |
| 9 | B1001 | 2018-01-01 | 1 | 57545 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 51.0 |
# Define function to calculate weighted medians
# From https://stackoverflow.com/a/73905572, fixed by ChatGPT
def weighted_median(values, weights):
i = np.argsort(values) # Sort values and get sorted indices
sorted_values = values.iloc[i] # Reorder values using the sorted indices
c = np.cumsum(weights.iloc[i]) # Cumulative sum of sorted weights
median_index = np.searchsorted(c, 0.5 * c.iloc[-1]) # Find the median index
return sorted_values.iloc[median_index] # Return the value at the median index
# Take weighted medians of columns that need that
weighted_mediancols = ['Condition_OverallCondition', 'Condition_OverallGrade']
groupby_cols = ['Reporting ID', 'Data Year', 'BldgNum']
for col in weighted_mediancols:
result = merged_clean.groupby(groupby_cols).apply(
lambda x: weighted_median(x[col], x['Interior_LivingArea']),
include_groups=False
).reset_index(name='Weighted_Median')
aggdf[col] = result['Weighted_Median']
# Function to compute weighted mode
# From ChatGPT: "Get the weighted mode of a categorical column in pandas."
def weighted_mode(category, weight):
# Combine categories and weights
counts = Counter()
for cat, w in zip(category, weight):
counts[cat] += w
# Find the mode(s) with the highest weight
max_weight = max(counts.values())
mode = [key for key, val in counts.items() if val == max_weight]
return mode[0] # Return the first mode in case of ties
# Take weighted mode of columns that need that
weighted_modecols = ['Exterior_WallType', 'Exterior_RoofMaterial', 'Interior_Flooring', 'Systems_HeatType', 'Systems_HeatFuel',
'Imputed_Exterior_RoofMaterial', 'Imputed_Condition_YearBuilt', 'Imputed_Exterior_WallType', 'Imputed_OverallCondition',
'Imputed_OverallGrade']
groupby_cols = ['Reporting ID', 'Data Year', 'BldgNum']
for col in weighted_modecols:
result = merged_clean.groupby(groupby_cols).apply(
lambda x: weighted_mode(x[col], x['Interior_LivingArea']),
include_groups=False
).reset_index(name='Weighted_Mode')
aggdf[col] = result['Weighted_Mode']
aggdf.head()
| Reporting ID | Data Year | BldgNum | Interior_LivingArea | Interior_TotalRooms | Interior_Bedrooms | Interior_Kitchens | Interior_Fireplaces | Parking_Garage | UnfinishedBasementGross | FinishedBasementGross | Interior_NumUnits | Condition_OverallCondition | Condition_OverallGrade | Exterior_WallType | Exterior_RoofMaterial | Interior_Flooring | Systems_HeatType | Systems_HeatFuel | Imputed_Exterior_RoofMaterial | Imputed_Condition_YearBuilt | Imputed_Exterior_WallType | Imputed_OverallCondition | Imputed_OverallGrade | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B1000 | 2016-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | 4 | 4 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 |
| 1 | B1000 | 2017-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | 4 | 4 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 |
| 2 | B1000 | 2018-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | 4 | 4 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 |
| 3 | B1000 | 2019-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | 4 | 4 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 |
| 4 | B1000 | 2020-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | 4 | 4 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 |
# Take weighted means of columns that need that
weighted_meancols = ['Systems_Plumbing', 'Exterior_WallHeight']
for col in weighted_meancols:
aggdf[col] = merged_clean.groupby(groupby_cols).apply(
lambda x: (x[col] * x['Interior_LivingArea']) / max(0.0000000001, sum(x['Interior_LivingArea']) * len(x)),
include_groups=False,
).reset_index(name=col)[col]
aggdf.head(10)
| Reporting ID | Data Year | BldgNum | Interior_LivingArea | Interior_TotalRooms | Interior_Bedrooms | Interior_Kitchens | Interior_Fireplaces | Parking_Garage | UnfinishedBasementGross | FinishedBasementGross | Interior_NumUnits | Condition_OverallCondition | Condition_OverallGrade | Exterior_WallType | Exterior_RoofMaterial | Interior_Flooring | Systems_HeatType | Systems_HeatFuel | Imputed_Exterior_RoofMaterial | Imputed_Condition_YearBuilt | Imputed_Exterior_WallType | Imputed_OverallCondition | Imputed_OverallGrade | Systems_Plumbing | Exterior_WallHeight | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B1000 | 2016-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | 4 | 4 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 1.0 | 12.0 |
| 1 | B1000 | 2017-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | 4 | 4 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 1.0 | 12.0 |
| 2 | B1000 | 2018-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | 4 | 4 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 1.0 | 12.0 |
| 3 | B1000 | 2019-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | 4 | 4 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 1.0 | 12.0 |
| 4 | B1000 | 2020-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | 4 | 4 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 1.0 | 12.0 |
| 5 | B1000 | 2021-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | 4 | 4 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 1.0 | 12.0 |
| 6 | B1000 | 2022-01-01 | 1 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | 4 | 4 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 1.0 | 12.0 |
| 7 | B1001 | 2016-01-01 | 1 | 57545 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 51.0 | 5 | 3 | METAL-GLASS | RUBBER-MEMBRAN | NaN | HOT-WATER | Gas | 0 | 0 | 0 | 0 | 0 | 1.0 | 14.0 |
| 8 | B1001 | 2017-01-01 | 1 | 57545 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 51.0 | 5 | 3 | METAL-GLASS | RUBBER-MEMBRAN | NaN | HOT-WATER | Gas | 0 | 0 | 0 | 0 | 0 | 1.0 | 14.0 |
| 9 | B1001 | 2018-01-01 | 1 | 57545 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 51.0 | 5 | 3 | METAL-GLASS | RUBBER-MEMBRAN | NaN | HOT-WATER | Gas | 0 | 0 | 0 | 0 | 0 | 1.0 | 14.0 |
# Aggregate to property level
aggdf2 = aggdf.groupby(['Reporting ID', 'Data Year'])[sumcols].agg('sum').reset_index()
for col in weighted_modecols:
result = aggdf.groupby(['Reporting ID', 'Data Year']).apply(
lambda x: weighted_mode(x[col], x['Interior_LivingArea']),
include_groups=False,
).reset_index(name='Weighted_Mode')
aggdf2[col] = result['Weighted_Mode']
for col in weighted_mediancols:
result = aggdf.groupby(['Reporting ID', 'Data Year']).apply(
lambda x: weighted_median(x[col], x['Interior_LivingArea']),
include_groups=False
).reset_index(name='Weighted_Median')
aggdf2[col] = result['Weighted_Median']
aggdf2['Systems_Plumbing'] = aggdf.groupby(['Reporting ID', 'Data Year']).apply(
lambda x: (x['Systems_Plumbing'] * x['Interior_LivingArea']) / max(0.0000000001, sum(x['Interior_LivingArea']) * len(x)),
include_groups=False,
).reset_index(name='Systems_Plumbing')['Systems_Plumbing']
aggdf2.head(10)
| Reporting ID | Data Year | Interior_LivingArea | Interior_TotalRooms | Interior_Bedrooms | Interior_Kitchens | Interior_Fireplaces | Parking_Garage | UnfinishedBasementGross | FinishedBasementGross | Interior_NumUnits | Exterior_WallType | Exterior_RoofMaterial | Interior_Flooring | Systems_HeatType | Systems_HeatFuel | Imputed_Exterior_RoofMaterial | Imputed_Condition_YearBuilt | Imputed_Exterior_WallType | Imputed_OverallCondition | Imputed_OverallGrade | Condition_OverallCondition | Condition_OverallGrade | Systems_Plumbing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B1000 | 2016-01-01 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 4 | 4 | 1.0 |
| 1 | B1000 | 2017-01-01 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 4 | 4 | 1.0 |
| 2 | B1000 | 2018-01-01 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 4 | 4 | 1.0 |
| 3 | B1000 | 2019-01-01 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 4 | 4 | 1.0 |
| 4 | B1000 | 2020-01-01 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 4 | 4 | 1.0 |
| 5 | B1000 | 2021-01-01 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 4 | 4 | 1.0 |
| 6 | B1000 | 2022-01-01 | 144489 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 43563 | 0.0 | 0.0 | BRICK | RUBBER-MEMBRAN | NaN | FORCED-AIR | Gas | 0 | 0 | 0 | 0 | 0 | 4 | 4 | 1.0 |
| 7 | B1001 | 2016-01-01 | 57545 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 51.0 | METAL-GLASS | RUBBER-MEMBRAN | NaN | HOT-WATER | Gas | 0 | 0 | 0 | 0 | 0 | 5 | 3 | 1.0 |
| 8 | B1001 | 2017-01-01 | 57545 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 51.0 | METAL-GLASS | RUBBER-MEMBRAN | NaN | HOT-WATER | Gas | 0 | 0 | 0 | 0 | 0 | 5 | 3 | 1.0 |
| 9 | B1001 | 2018-01-01 | 57545 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 51.0 | METAL-GLASS | RUBBER-MEMBRAN | NaN | HOT-WATER | Gas | 0 | 0 | 0 | 0 | 0 | 5 | 3 | 1.0 |
# Check shape of energy use portion of merged_clean
merged_clean.loc[:,:'Water Intensity (All Water Sources) (gal/ft2)'].drop_duplicates(ignore_index=True).shape
(4726, 31)
# Merge with full energy dataset
merged_agg = pd.merge(merged_clean.loc[:,:'Water Intensity (All Water Sources) (gal/ft2)'].drop_duplicates(ignore_index=True), aggdf2, on=['Reporting ID', 'Data Year'])
print(merged_agg.shape)
merged_agg.head()
(4726, 53)
| Reporting ID | Data Year | BEUDO Category | PD Parcel Living Area | PD Parcel Units | Year Built | Buildings Included Count | Primary Property Type - Self Selected | Property GFA - Self Reported (ft2) | Reported Residential Units | ENERGY STAR Score | Electricity Use - Grid Purchase (kBtu) | Natural Gas Use (kBtu) | Fuel Oil #2 Use (kBtu) | Diesel #2 Use (kBtu) | Kerosene Use (kBtu) | District Chilled Water Use (kBtu) | District Steam Use (kBtu) | Electricity Use - Generated From Onsite Renewable Systems (kWh) | Site Energy Use (kBtu) | Weather Normalized Site Energy Use (kBtu) | Site EUI (kBtu/ft2) | Weather Normalized Site EUI (kBtu/ft2) | Source Energy Use (kBtu) | Weather Normalized Source Energy Use (kBtu) | Source EUI (kBtu/ft2) | Weather Normalized Source EUI (kBtu/ft2) | Total GHG Emissions (Metric Tons CO2e) | Total GHG Emissions Intensity (kgCO2e/ft2) | Water Use (All Water Sources) (kgal) | Water Intensity (All Water Sources) (gal/ft2) | Interior_LivingArea | Interior_TotalRooms | Interior_Bedrooms | Interior_Kitchens | Interior_Fireplaces | Parking_Garage | UnfinishedBasementGross | FinishedBasementGross | Interior_NumUnits | Exterior_WallType | Exterior_RoofMaterial | Interior_Flooring | Systems_HeatType | Systems_HeatFuel | Imputed_Exterior_RoofMaterial | Imputed_Condition_YearBuilt | Imputed_Exterior_WallType | Imputed_OverallCondition | Imputed_OverallGrade | Condition_OverallCondition | Condition_OverallGrade | Systems_Plumbing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B1600 | 2016-01-01 | Non-Residential | 57000.0 | 1.0 | 1910-01-01 | 1 | Office | 38000.0 | 0.0 | NaN | 32880.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 32880.5 | 32880.5 | 0.9 | 0.9 | 92065.5 | 92065.5 | 2.4 | 2.4 | 0.0 | 0.0 | NaN | NaN | 33904 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 1.0 | BRICK | TAR-GRAVEL | NaN | STEAM | Other | 0 | 0 | 0 | 0 | 0 | 2 | 3 | NaN |
| 1 | B1360 | 2017-01-01 | Non-Residential | 1462119.0 | 18.0 | 2010-01-01 | 1 | Parking | 223000.0 | 0.0 | NaN | 1876014.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1876014.7 | 1876014.7 | 8.4 | 8.4 | 5252841.3 | 5252841.3 | 23.6 | 23.6 | 140.6 | 0.6 | NaN | NaN | 1425620 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 70998 | 0.0 | 17.0 | BRICK | ROLL-ROOFING | NaN | PACKAGE-A/C | NaN | 0 | 0 | 0 | 0 | 0 | 4 | 5 | NaN |
| 2 | B1898 | 2017-01-01 | Municipal | 12571.0 | 0.0 | 1993-01-01 | 1 | Restaurant | 3912.0 | 0.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | BRICK | OTHER | NaN | OTHER | Other | 0 | 0 | 0 | 0 | 0 | 5 | 7 | 0.000606 |
| 3 | B1326 | 2019-01-01 | Non-Residential | 3691096.0 | 0.0 | 1928-01-01 | 1 | College/University | 55046.0 | 0.0 | NaN | 1907403.6 | 0.0 | 0.0 | 0.0 | 0.0 | 1154460.1 | 3708563.7 | 0.0 | 6770427.4 | 6713394.5 | 123.0 | 122.0 | 10852843.2 | 10840848.1 | 197.2 | 196.9 | 432.3 | 7.9 | 286.5 | 5.2 | 3591730 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 98291 | 0.0 | 1.0 | OTHER | TAR-GRAVEL | NaN | PACKAGE-A/C | NaN | 0 | 0 | 0 | 0 | 0 | 2 | 5 | NaN |
| 4 | B1441 | 2019-01-01 | Residential | 67125.0 | 66.0 | 1922-01-01 | 1 | Multifamily Housing | 94360.0 | 66.0 | 46.0 | 603674.9 | 6004800.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6608474.9 | 6971311.4 | 70.0 | 73.9 | 7995329.7 | 8361688.0 | 84.7 | 88.6 | 358.6 | 3.8 | NaN | NaN | 67125 | 305.0 | 87.0 | 65 | 40.0 | 0.0 | 0 | 0.0 | 66.0 | BRICK | TAR-GRAVEL | Hardwood | STEAM | NaN | 1 | 0 | 1 | 0 | 0 | 3 | 3 | 0.000275 |
EDA after data preprocessing¶
# Subset the data down to residential properties for simplicity
resid_df = merged_agg[merged_agg['BEUDO Category']=='Residential']
resid_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 1068 entries, 4 to 4725 Data columns (total 53 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Reporting ID 1068 non-null object 1 Data Year 1068 non-null datetime64[ns] 2 BEUDO Category 1068 non-null object 3 PD Parcel Living Area 1068 non-null float64 4 PD Parcel Units 1034 non-null float64 5 Year Built 1068 non-null datetime64[ns] 6 Buildings Included Count 1068 non-null int64 7 Primary Property Type - Self Selected 1068 non-null object 8 Property GFA - Self Reported (ft2) 1067 non-null float64 9 Reported Residential Units 840 non-null float64 10 ENERGY STAR Score 824 non-null float64 11 Electricity Use - Grid Purchase (kBtu) 1059 non-null float64 12 Natural Gas Use (kBtu) 1068 non-null float64 13 Fuel Oil #2 Use (kBtu) 1068 non-null float64 14 Diesel #2 Use (kBtu) 1068 non-null float64 15 Kerosene Use (kBtu) 1068 non-null float64 16 District Chilled Water Use (kBtu) 1068 non-null float64 17 District Steam Use (kBtu) 1068 non-null float64 18 Electricity Use - Generated From Onsite Renewable Systems (kWh) 1068 non-null float64 19 Site Energy Use (kBtu) 1068 non-null float64 20 Weather Normalized Site Energy Use (kBtu) 964 non-null float64 21 Site EUI (kBtu/ft2) 1068 non-null float64 22 Weather Normalized Site EUI (kBtu/ft2) 1068 non-null float64 23 Source Energy Use (kBtu) 1068 non-null float64 24 Weather Normalized Source Energy Use (kBtu) 964 non-null float64 25 Source EUI (kBtu/ft2) 1068 non-null float64 26 Weather Normalized Source EUI (kBtu/ft2) 964 non-null float64 27 Total GHG Emissions (Metric Tons CO2e) 1068 non-null float64 28 Total GHG Emissions Intensity (kgCO2e/ft2) 1068 non-null float64 29 Water Use (All Water Sources) (kgal) 993 non-null float64 30 Water Intensity (All Water Sources) (gal/ft2) 958 non-null float64 31 Interior_LivingArea 1068 non-null int64 32 Interior_TotalRooms 1068 non-null float64 33 Interior_Bedrooms 1068 non-null float64 34 Interior_Kitchens 1068 non-null int32 35 Interior_Fireplaces 1068 non-null float64 36 Parking_Garage 1068 non-null float64 37 UnfinishedBasementGross 1068 non-null int64 38 FinishedBasementGross 1068 non-null float64 39 Interior_NumUnits 1068 non-null float64 40 Exterior_WallType 1068 non-null object 41 Exterior_RoofMaterial 1068 non-null object 42 Interior_Flooring 246 non-null object 43 Systems_HeatType 1060 non-null object 44 Systems_HeatFuel 700 non-null object 45 Imputed_Exterior_RoofMaterial 1068 non-null int64 46 Imputed_Condition_YearBuilt 1068 non-null int64 47 Imputed_Exterior_WallType 1068 non-null int64 48 Imputed_OverallCondition 1068 non-null int64 49 Imputed_OverallGrade 1068 non-null int64 50 Condition_OverallCondition 1068 non-null int64 51 Condition_OverallGrade 1068 non-null int64 52 Systems_Plumbing 574 non-null object dtypes: datetime64[ns](2), float64(31), int32(1), int64(10), object(9) memory usage: 446.4+ KB
# Taking log of Weather Normalized Site EUI
resid_df = resid_df.copy()
resid_df['Log_Weather_Normalized_Site_EUI'] = np.log1p(resid_df['Weather Normalized Site EUI (kBtu/ft2)'])
# Subset to relevant features
modeling_cols = ['PD Parcel Living Area', 'PD Parcel Units', 'Year Built', 'Primary Property Type - Self Selected', 'Property GFA - Self Reported (ft2)',
'Interior_LivingArea', 'Interior_TotalRooms', 'Interior_Bedrooms', 'Interior_Kitchens', 'Interior_Fireplaces', 'Parking_Garage', 'UnfinishedBasementGross', 'FinishedBasementGross',
'Interior_NumUnits', 'Exterior_WallType', 'Imputed_Exterior_WallType', 'Exterior_RoofMaterial', 'Imputed_Exterior_RoofMaterial',
'Condition_OverallCondition', 'Imputed_OverallCondition', 'Condition_OverallGrade', 'Imputed_OverallGrade']#, 'Systems_HeatType']
response_col = 'Log_Weather_Normalized_Site_EUI'
x = resid_df[modeling_cols]
y = resid_df[response_col]
Real evaluation will have the train set split from the test set earlier in the notebook, but due to time constraints we're focused on just getting everything to run through properly. We'll adjust the placement of the split and evaluation at a later date, and we'll do our proper model evaluations once that's been done. For now, we need to get the data in a workable state for any modeling.
# Check head of training data
x.head(5)
| PD Parcel Living Area | PD Parcel Units | Year Built | Primary Property Type - Self Selected | Property GFA - Self Reported (ft2) | Interior_LivingArea | Interior_TotalRooms | Interior_Bedrooms | Interior_Kitchens | Interior_Fireplaces | Parking_Garage | UnfinishedBasementGross | FinishedBasementGross | Interior_NumUnits | Exterior_WallType | Imputed_Exterior_WallType | Exterior_RoofMaterial | Imputed_Exterior_RoofMaterial | Condition_OverallCondition | Imputed_OverallCondition | Condition_OverallGrade | Imputed_OverallGrade | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 67125.0 | 66.0 | 1922-01-01 | Multifamily Housing | 94360.0 | 67125 | 305.0 | 87.0 | 65 | 40.0 | 0.0 | 0 | 0.0 | 66.0 | BRICK | 1 | TAR-GRAVEL | 1 | 3 | 0 | 3 | 0 |
| 21 | 69317.0 | 148.0 | 2007-01-01 | Residence Hall/Dormitory | 4600.0 | 138634 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 26490 | 0.0 | 282.0 | BRICK-VENEER | 0 | TAR-GRAVEL | 0 | 2 | 0 | 3 | 0 |
| 26 | 61416.0 | 56.0 | 1981-01-01 | Multifamily Housing | 82696.0 | 61416 | 248.0 | 113.0 | 55 | 19.0 | 26.0 | 0 | 0.0 | 56.0 | BRICK | 1 | TAR-GRAVEL | 1 | 3 | 0 | 2 | 0 |
| 27 | 282112.0 | 186.0 | 1898-01-01 | Multifamily Housing | 213000.0 | 282112 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 186.0 | CONCR-BLOCK | 0 | TAR-GRAVEL | 0 | 4 | 0 | 4 | 0 |
| 28 | 55406.0 | 102.0 | 1979-01-01 | Multifamily Housing | 79017.0 | 55406 | 263.0 | 110.0 | 102 | 0.0 | 0.0 | 0 | 0.0 | 102.0 | BRICK | 1 | TAR-GRAVEL | 1 | 2 | 0 | 2 | 0 |
# Check remaining missing values in training data
x.info()
<class 'pandas.core.frame.DataFrame'> Index: 1068 entries, 4 to 4725 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PD Parcel Living Area 1068 non-null float64 1 PD Parcel Units 1034 non-null float64 2 Year Built 1068 non-null datetime64[ns] 3 Primary Property Type - Self Selected 1068 non-null object 4 Property GFA - Self Reported (ft2) 1067 non-null float64 5 Interior_LivingArea 1068 non-null int64 6 Interior_TotalRooms 1068 non-null float64 7 Interior_Bedrooms 1068 non-null float64 8 Interior_Kitchens 1068 non-null int32 9 Interior_Fireplaces 1068 non-null float64 10 Parking_Garage 1068 non-null float64 11 UnfinishedBasementGross 1068 non-null int64 12 FinishedBasementGross 1068 non-null float64 13 Interior_NumUnits 1068 non-null float64 14 Exterior_WallType 1068 non-null object 15 Imputed_Exterior_WallType 1068 non-null int64 16 Exterior_RoofMaterial 1068 non-null object 17 Imputed_Exterior_RoofMaterial 1068 non-null int64 18 Condition_OverallCondition 1068 non-null int64 19 Imputed_OverallCondition 1068 non-null int64 20 Condition_OverallGrade 1068 non-null int64 21 Imputed_OverallGrade 1068 non-null int64 dtypes: datetime64[ns](1), float64(9), int32(1), int64(8), object(3) memory usage: 187.7+ KB
# For sake of modeling, drop rows that still have NAs
x = x.dropna()
y = y[x.index]
One-hot encode categorical¶
Categorical columns need to be one-hot encoded. The OneHotEncoder function will let us apply the encoding separately to train and test.
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size = 0.8, random_state=42)
# One-hot encode categorical columns
categorical_columns = ['Primary Property Type - Self Selected', 'Exterior_WallType', 'Exterior_RoofMaterial']#, 'Systems_HeatType']
enc = OneHotEncoder(drop='first',handle_unknown='ignore')
oh_cols = enc.fit_transform(X_train[categorical_columns])
oh_cols = pd.DataFrame(oh_cols.toarray(), columns = enc.get_feature_names_out(), index=X_train.index)
X_train = X_train.drop(categorical_columns, axis=1).join(oh_cols)
X_train.head()
oh_cols_test = enc.transform(X_test[categorical_columns])
oh_cols_test = pd.DataFrame(oh_cols_test.toarray(), columns=enc.get_feature_names_out(), index=X_test.index)
X_test = X_test.drop(categorical_columns, axis=1).join(oh_cols_test)
Scale data¶
Our data uses a wide range of scales. Most models, including our baseline linear model, need all columns to use similar scales. We'll use StandardScaler to make it easy to apply the same scaling to train and test.
# Convert year built to integer
X_train['Year Built'] = X_train['Year Built'].values.astype(int)
X_test['Year Built'] = X_test['Year Built'].values.astype(int)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_scaled, columns = scaler.get_feature_names_out(), index=X_train.index)
X_train.head()
X_test_scale= scaler.transform(X_test)
X_test = pd.DataFrame(X_test_scale, columns=scaler.get_feature_names_out(), index=X_test.index)
Fit simple regression model¶
A simple linear regression model makes a great baseline. Simple, interpretable, and with useful assumptions to test.
# Create and fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_train)
Interpret results¶
# Evaluate the model
mse = mean_squared_error(y_train, y_pred)
r2 = r2_score(y_train, y_pred)
# Display results
print(f"Model Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
Model Coefficients: [ 7.86910848e-02 1.30541448e-02 3.61236881e-08 1.33060314e-01 -3.29978765e-01 5.33104513e-01 -2.03807072e-01 -8.11799053e-02 -4.81181126e-02 -1.79132108e-01 8.08906837e-02 8.98294674e-03 2.23442970e-02 9.95786879e-02 -2.03352620e-01 -1.60810022e-01 -6.51947217e-01 1.01651331e-01 2.54917162e-01 -1.62915728e-01 -6.86347692e-02 1.37791455e-01 -2.16696645e-01 -1.53056420e-01 -5.13257411e-02 -9.32335107e-02 -1.90340940e-01 4.01793074e-02 3.46240165e-03 1.06647354e-02 4.09829884e-02 5.86813482e-02 1.01303417e-01 3.76942029e-02 -4.67498495e-03 5.77766027e-03 1.28023266e-01 2.57709966e-02] Intercept: 3.9291181825384003 Mean Squared Error: 0.6439176410873254 R-squared: 0.35150102451811294
# Get model diagnostic plots
# From ChatGPT: "Get regression analysis plots in Python. Residuals, Q-Q plot, etc, in the style of R regression diagnostics."
# From ChatGPT: "Can we get a Cook's Distance plot in Python like we can in R?" "Can I get the above results with a Scikit-learn model?"
# Get residuals
residuals = y_train - y_pred
# Calculate leverage (hat values)
X_with_const = np.hstack([np.ones((X_train.shape[0], 1)), X_train]) # Add constant for intercept
H = X_with_const @ np.linalg.inv(X_with_const.T @ X_with_const) @ X_with_const.T # Hat matrix
leverage = np.diag(H)
# Calculate Cook's Distance
n = len(y_train)
p = X_with_const.shape[1]
cooks_d = (residuals**2 / (p * np.var(residuals))) * (leverage / (1 - leverage)**2)
# Get diagnostic plots
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,10))
axes = axes.flatten()
# 1. Residuals vs. Fitted Values
axes[0].scatter(y_pred, residuals, alpha=0.7, edgecolor='k')
axes[0].axhline(0, color='red', linestyle='--')
axes[0].set_title('Residuals vs Fitted')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
# 2. Q-Q Plot
qqplot(residuals, line='45', fit=True, ax=axes[1])
axes[1].set_title('Q-Q Plot')
# 3. Scale-Location Plot (Standardized Residuals vs Fitted Values)
standardized_residuals = residuals / np.std(residuals)
axes[2].scatter(y_pred, np.sqrt(np.abs(standardized_residuals)), alpha=0.7, edgecolor='k')
axes[2].axhline(0, color='red', linestyle='--')
axes[2].set_title('Scale-Location Plot')
axes[2].set_xlabel('Fitted Values')
axes[2].set_ylabel('Sqrt(|Standardized Residuals|)')
# Plot Cook's Distance
axes[3].stem(np.arange(len(cooks_d)), cooks_d, basefmt=" ")
axes[3].axhline(4 / n, color='red', linestyle='--', label="Threshold (4/n)")
axes[3].set_title("Cook's Distance Plot")
axes[3].set_xlabel("Observation Index")
axes[3].set_ylabel("Cook's Distance")
axes[3].legend()
plt.show()
C:\Users\Z004KYBK\AppData\Local\Temp\ipykernel_512\1305674668.py:16: RuntimeWarning: divide by zero encountered in divide cooks_d = (residuals**2 / (p * np.var(residuals))) * (leverage / (1 - leverage)**2)
Residuals Pattern
- The residuals vs. fitted plot shows no particular evidence of heteroscedasticity. The fitted values at the ends are tightly grouped relative to the values around 4.0, but there aren't many of them, so that's not necessarily a problem.
- The Q-Q plot indicates that the residuals are not normally distributed.
Influential Observations
- There are a couple of points with relatively high Cook's Distance values. They aren't necessarily a problem, especially since this model has other major problems, but they're something to look into.
# Investigate point with high Cook's Distance
cooks_d[cooks_d > 0.08]
2306 0.273308 859 inf Name: Log_Weather_Normalized_Site_EUI, dtype: float64
7. Feature Importance and Interpretation¶
# Get model coefficients and sort by value
coefficients = pd.Series(model.coef_, index=X_train.columns)
coefficients = coefficients.abs().sort_values(ascending=False)
# Print feature importance
print("Feature Importance (based on absolute coefficients):")
print(coefficients.head(10))
Feature Importance (based on absolute coefficients): Imputed_OverallCondition 0.651947 Interior_TotalRooms 0.533105 Interior_LivingArea 0.329979 Imputed_OverallGrade 0.254917 Primary Property Type - Self Selected_Parking 0.216697 Interior_Bedrooms 0.203807 Imputed_Exterior_RoofMaterial 0.203353 Exterior_WallType_CONCR PANEL 0.190341 Parking_Garage 0.179132 Primary Property Type - Self Selected_Multifamily Housing 0.162916 dtype: float64
explainer = shap.Explainer(model.predict, X_train)
shap_values_lr = explainer(X_train)
shap.summary_plot(shap_values_lr, X_train, max_display=10,show=False)
plt.title("Polynomial Regression - SHAP Feature Importance")
plt.show()
PermutationExplainer explainer: 827it [00:15, 24.78it/s]
Model Diagnostics¶
- Residuals vs. Fitted Plot:
- This plot assesses the linearity assumption and checks for non-linearity in the data.
- The residuals appear scattered around the zero line with no clear pattern, which suggests that the linearity assumption is reasonably met.
- Q-Q Plot:
- The Q-Q plot evaluates the normality of residuals.
- Most points lie along the theoretical quantile line, indicating approximate normality of residuals, though there may be slight deviations in the tails.
- Scale-Location Plot:
- This plot examines the homoscedasticity assumption (constant variance of residuals).
- The residuals are mostly evenly spread, but some heteroscedasticity may be present at higher fitted values.
- Cook's Distance Plot:
- Cook's distance helps identify influential data points.
- A few points have higher Cook’s distance, but they are below the threshold, suggesting no extreme influence on the model.
- Performance Metrics:
- Mean Squared Error (MSE): 0.6439 indicates moderate error in predictions.
- R-squared: 0.3515 shows that the model explains 35.15% of the variance in the data, which suggests room for improvement.
Feature Commentary¶
- Interior_TotalRooms:
- Impact: High SHAP values (positive) for more total rooms suggest that larger buildings with more rooms contribute to higher EUI.
- Insight: More rooms likely indicate greater energy usage due to heating, cooling, and lighting requirements.
- Interior_LivingArea:
- Impact: Higher living area contributes to higher EUI (positive SHAP values).
- Insight: Larger areas require more energy for climate control and maintenance, making this a significant driver of energy usage.
- Imputed_OverallCondition:
- Impact: Poorer overall conditions are associated with higher EUI (positive SHAP values).
- Insight: Older or less maintained buildings might have inefficient insulation, appliances, or systems leading to increased energy usage.
- Imputed_Exterior_RoofMaterial:
- Impact: Certain roof materials are linked to higher EUI.
- Insight: This may be due to insulation properties or how well the roof material supports energy efficiency.
- Condition_OverallCondition:
- Impact: Higher values for this feature (likely representing better conditions) decrease EUI (negative SHAP values).
- Insight: Well-maintained buildings with efficient systems consume less energy.
- Interior_Bedrooms:
- Impact: More bedrooms correlate with slightly higher EUI.
- Insight: Similar to total rooms, more bedrooms likely indicate a larger building footprint and increased energy demand.
- Property GFA (Gross Floor Area):
- Impact: Higher GFA corresponds to increased EUI (positive SHAP values).
- Insight: Larger properties consume more energy overall due to greater heating, cooling, and operational demands.
- Primary Property Type - Multifamily Housing:
- Impact: Multifamily housing tends to have slightly higher EUI.
- Insight: Shared systems or higher density of usage (e.g., heating, cooling) may explain this trend.
- Parking_Garage:
- Impact: Parking garages are associated with slightly lower EUI (negative SHAP values).
- Insight: These structures may not require as much energy-intensive climate control as other areas.
- Imputed_Exterior_WallType:
- Impact: Some wall types correlate with higher EUI (positive SHAP values).
- Insight: Wall types with poor insulation properties or energy inefficiency might explain this relationship.
X_test.info()
<class 'pandas.core.frame.DataFrame'> Index: 207 entries, 1357 to 548 Data columns (total 38 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PD Parcel Living Area 207 non-null float64 1 PD Parcel Units 207 non-null float64 2 Year Built 207 non-null float64 3 Property GFA - Self Reported (ft2) 207 non-null float64 4 Interior_LivingArea 207 non-null float64 5 Interior_TotalRooms 207 non-null float64 6 Interior_Bedrooms 207 non-null float64 7 Interior_Kitchens 207 non-null float64 8 Interior_Fireplaces 207 non-null float64 9 Parking_Garage 207 non-null float64 10 UnfinishedBasementGross 207 non-null float64 11 FinishedBasementGross 207 non-null float64 12 Interior_NumUnits 207 non-null float64 13 Imputed_Exterior_WallType 207 non-null float64 14 Imputed_Exterior_RoofMaterial 207 non-null float64 15 Condition_OverallCondition 207 non-null float64 16 Imputed_OverallCondition 207 non-null float64 17 Condition_OverallGrade 207 non-null float64 18 Imputed_OverallGrade 207 non-null float64 19 Primary Property Type - Self Selected_Multifamily Housing 207 non-null float64 20 Primary Property Type - Self Selected_Other 207 non-null float64 21 Primary Property Type - Self Selected_Outpatient Rehabilitation/Physical Therapy 207 non-null float64 22 Primary Property Type - Self Selected_Parking 207 non-null float64 23 Primary Property Type - Self Selected_Residence Hall/Dormitory 207 non-null float64 24 Primary Property Type - Self Selected_Senior Living Community 207 non-null float64 25 Exterior_WallType_BRICK-VENEER 207 non-null float64 26 Exterior_WallType_CONCR PANEL 207 non-null float64 27 Exterior_WallType_CONCR-BLOCK 207 non-null float64 28 Exterior_WallType_FRAME-CLAPBD 207 non-null float64 29 Exterior_WallType_METAL-GLASS 207 non-null float64 30 Exterior_WallType_OTHER 207 non-null float64 31 Exterior_WallType_STUCCO 207 non-null float64 32 Exterior_WallType_WOOD-SHN-SHK 207 non-null float64 33 Exterior_RoofMaterial_OTHER 207 non-null float64 34 Exterior_RoofMaterial_ROLL-ROOFING 207 non-null float64 35 Exterior_RoofMaterial_RUBBER-MEMBRAN 207 non-null float64 36 Exterior_RoofMaterial_SLATE-CLAY 207 non-null float64 37 Exterior_RoofMaterial_TAR-GRAVEL 207 non-null float64 dtypes: float64(38) memory usage: 71.2 KB
degree = 2
# Step 1: Transform features
poly = PolynomialFeatures(degree=degree)
# Fit and transform the training data
X_train_poly = poly.fit_transform(X_train)
# Transform the test data
X_test_poly = poly.transform(X_test)
# Step 2: Fit the linear regression model
model = LinearRegression()
model.fit(X_train_poly, y_train)
# Step 3: Predict on the test data
y_pred = model.predict(X_test_poly)
# Step 4: Evaluate the model
print("MSE:", mean_squared_error(y_test, y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))
MSE: 3.0513931928656956e+22 R^2 Score: -4.183988698343098e+22
pipeline = Pipeline([
('poly', PolynomialFeatures()),
('linear', LinearRegression())
])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))
MSE: 0.6380529476686705 R^2 Score: 0.12511887087121865
param_grid = {'poly__degree': [1, 2, 3]} # Range of degrees to try
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)
print("Best Degree:", grid.best_params_['poly__degree'])
print("Best Model MSE:", -grid.best_score_)
Best Degree: 1 Best Model MSE: 0.7325268938073665
best_poly_model = grid.best_estimator_
result = permutation_importance(
best_poly_model, X_train, y_train, n_repeats=10, random_state=42, n_jobs=-1
)
forest_importances = pd.DataFrame({'Feature': X_train.columns, 'Importance': result.importances_mean})
forest_importances = forest_importances.sort_values(by='Importance', ascending=False)
print("\nPolynomial Regression Feature Importance (Top 10 using Permutation Importance):")
print(forest_importances.head(10))
Polynomial Regression Feature Importance (Top 10 using Permutation Importance):
Feature Importance
16 Imputed_OverallCondition 0.864381
5 Interior_TotalRooms 0.567355
4 Interior_LivingArea 0.214203
18 Imputed_OverallGrade 0.127380
22 Primary Property Type - Self Selected_Parking 0.093451
14 Imputed_Exterior_RoofMaterial 0.084548
6 Interior_Bedrooms 0.082515
26 Exterior_WallType_CONCR PANEL 0.071621
9 Parking_Garage 0.064767
19 Primary Property Type - Self Selected_Multifam... 0.054370
y_test_pred = grid.best_estimator_.predict(X_test)
mse = mean_squared_error(y_test, y_test_pred)
r2 = r2_score(y_test, y_test_pred)
# Display results
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
Mean Squared Error: 0.6380529476686706 R-squared: 0.12511887087121842
explainer_poly = shap.Explainer(grid.best_estimator_.predict, X_train)
shap_values_poly = explainer_poly(X_train)
shap.summary_plot(shap_values_poly, X_train, max_display=10,show=False)
plt.title("Polynomial Regression - SHAP Feature Importance")
plt.show()
PermutationExplainer explainer: 827it [00:13, 16.68it/s]
Model Diagnostics¶
- Best Model Selection:
- After performing hyperparameter tuning with polynomial regression degrees 1 through 3, the model with degree = 1 was selected as the best performer, yielding the lowest Mean Squared Error (MSE) of 0.736981.
- Residuals and Fit Assessment:
- Based on the visualizations (e.g., residual plots), the linear (degree 1) polynomial model appears to fit the data reasonably well without introducing overfitting. The residuals do not show strong patterns, indicating that the linear model appropriately captures the underlying relationships.
- Cross-Validation Performance:
- The cross-validation process (5-fold CV) ensures robustness in performance evaluation. The selected model generalizes well across different data splits, further supporting its reliability.
- Model Limitations:
- Despite the selected model's simplicity, the MSE value suggests there is still room for improvement. Including non-linear features, interaction terms, or other transformations may help improve the model's predictive power.
Feature Impact Analysis¶
Feature importance was evaluated using Permutation Importance. Below is a summary of the top 10 features and their contributions:
- Property GFA - Self Reported (ft²):
- Impact: The most important feature, significantly influencing EUI.
- Insight: Larger gross floor areas consume more energy due to higher heating, cooling, and operational needs.
- Imputed_OverallCondition:
- Impact: Strongly affects predictions. Poorer overall condition (low values) corresponds to higher EUI.
- Insight: Buildings in worse condition likely have inefficient systems or outdated structures contributing to increased energy usage.
- Interior_LivingArea:
- Impact: Higher living area correlates with higher EUI.
- Insight: Larger interior spaces require more energy to heat, cool, and light.
- Year Built:
- Impact: Older buildings (lower Year Built values) tend to have higher EUI.
- Insight: This is likely due to outdated building standards and less efficient materials or systems in older constructions.
- Primary Property Type - Self Selected_Parking:
- Impact: Parking properties show some influence on EUI.
- Insight: Such structures might have unique energy demands, such as lighting or ventilation systems.
- Parking_Garage:
- Impact: Garages slightly increase EUI.
- Insight: Ventilation and lighting contribute to energy consumption.
- PD Parcel Living Area:
- Impact: Living area on a parcel affects energy usage.
- Insight: Similar to GFA, larger areas are directly associated with higher EUI.
- UnfinishedBasementGross:
- Impact: Presence and size of unfinished basements affect EUI.
- Insight: Energy usage here may arise from temperature regulation or other maintenance requirements.
- Condition_OverallGrade:
- Impact: Higher grades indicate better conditions and are associated with lower EUI.
- Insight: Well-maintained or recently updated buildings consume less energy due to better efficiency.
- Interior_NumUnits:
- Impact: More units correspond to slightly increased EUI.
- Insight: Multi-unit buildings may have higher energy demands due to shared utilities and increased occupancy.
8. Advanced Model¶
The baseline linear regression model was a decent starting point, but we can do better. We'll use a tree-based model to get around the need for normality. Our biggest concern with this data is bias due to questionably useful data, so gradient boosting will be a great way to minimize that issue.
# Drop observation 859 for outlier reasons
X_train.drop(859, axis=0, inplace=True)
y_train.drop(859, axis=0, inplace=True)
# Use cross-validation to find best gradient boosting parameters
boost_model = GradientBoostingRegressor(random_state=66)
param_grid = {'n_estimators': [100,150,200],
'learning_rate': [0.05,0.1,0.2],
'max_depth': [1,2,3,4,5]}
grid = GridSearchCV(boost_model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)
print("Best Tree Count:", grid.best_params_['n_estimators'])
print("Best Learning Rate:", grid.best_params_['learning_rate'])
print("Best Tree Depth:", grid.best_params_['max_depth'])
print("Best Model MSE:", -grid.best_score_)
Best Tree Count: 200 Best Learning Rate: 0.05 Best Tree Depth: 5 Best Model MSE: 0.31053237304082576
# Initialize and train final gradient boosting regressor
model = GradientBoostingRegressor(
n_estimators=grid.best_params_['n_estimators'],
learning_rate=grid.best_params_['learning_rate'],
max_depth=grid.best_params_['max_depth'],
random_state=66
)
model.fit(X_train, y_train)
# Predict on train data
y_pred = model.predict(X_train)
# Evaluate the model
mse = mean_squared_error(y_train, y_pred)
r2 = r2_score(y_train, y_pred)
# Display results
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
Mean Squared Error: 0.14253184809986164 R-squared: 0.8563856181811315
# Get feature importance
feature_importance = model.feature_importances_
feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': feature_importance
}).sort_values(by='Importance', ascending=False)
print("\nFeature Importance (Top 10):")
print(feature_importance_df.head(10))
Feature Importance (Top 10):
Feature Importance
3 Property GFA - Self Reported (ft2) 0.274667
16 Imputed_OverallCondition 0.189571
2 Year Built 0.054604
4 Interior_LivingArea 0.053848
0 PD Parcel Living Area 0.050679
22 Primary Property Type - Self Selected_Parking 0.048985
10 UnfinishedBasementGross 0.048089
9 Parking_Garage 0.030055
15 Condition_OverallCondition 0.028430
11 FinishedBasementGross 0.027909
explainer_gb = shap.Explainer(model, X_train)
shap_values_gb = explainer_gb(X_train)
shap.summary_plot(shap_values_gb, X_train, max_display=10, show=False)
plt.title("Gradient Boosting Regressor Regression - SHAP Feature Importance")
plt.show()
# Get model diagnostic plots
# From ChatGPT: "Get regression analysis plots in Python. Residuals, Q-Q plot, etc, in the style of R regression diagnostics."
# From ChatGPT: "Can we get a Cook's Distance plot in Python like we can in R?" "Can I get the above results with a Scikit-learn model?"
# Get residuals
residuals = y_train - y_pred
# Calculate leverage (hat values)
X_with_const = np.hstack([np.ones((X_train.shape[0], 1)), X_train]) # Add constant for intercept
H = X_with_const @ np.linalg.inv(X_with_const.T @ X_with_const) @ X_with_const.T # Hat matrix
leverage = np.diag(H)
# Calculate Cook's Distance
n = len(y_train)
p = X_with_const.shape[1]
cooks_d = (residuals**2 / (p * np.var(residuals))) * (leverage / (1 - leverage)**2)
# Get diagnostic plots
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,10))
axes = axes.flatten()
# 1. Residuals vs. Fitted Values
axes[0].scatter(y_pred, residuals, alpha=0.7, edgecolor='k')
axes[0].axhline(0, color='red', linestyle='--')
axes[0].set_title('Residuals vs Fitted')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
# 2. Q-Q Plot
qqplot(residuals, line='45', fit=True, ax=axes[1])
axes[1].set_title('Q-Q Plot')
# 3. Scale-Location Plot (Standardized Residuals vs Fitted Values)
standardized_residuals = residuals / np.std(residuals)
axes[2].scatter(y_pred, np.sqrt(np.abs(standardized_residuals)), alpha=0.7, edgecolor='k')
axes[2].axhline(0, color='red', linestyle='--')
axes[2].set_title('Scale-Location Plot')
axes[2].set_xlabel('Fitted Values')
axes[2].set_ylabel('Sqrt(|Standardized Residuals|)')
# Plot Cook's Distance
axes[3].stem(np.arange(len(cooks_d)), cooks_d, basefmt=" ")
axes[3].axhline(4 / n, color='red', linestyle='--', label="Threshold (4/n)")
axes[3].set_title("Cook's Distance Plot")
axes[3].set_xlabel("Observation Index")
axes[3].set_ylabel("Cook's Distance")
axes[3].legend()
plt.show()
# Investigate point with high Cook's Distance
cooks_d[cooks_d > 0.08]
754 0.236292 2296 0.236292 Name: Log_Weather_Normalized_Site_EUI, dtype: float64
# Evaluate final model on test set
y_test_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_test_pred)
r2 = r2_score(y_test, y_test_pred)
# Display results
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
Mean Squared Error: 0.18977851540816093 R-squared: 0.7397807776747487
Our final model achieves an R-squared of 0.76 on the test set. Considering how uncooperative the data was, this is a good result. We can be reasonably confident in the feature importances, at least those at the top of the list.
Model Diagnostics:¶
Residuals vs. Fitted: The plot shows a fairly random scatter of residuals around the zero line, which is a good sign. It suggests that the assumption of homoscedasticity (constant variance of errors) is reasonably met. There's a slight hint of a funnel shape, which might indicate a mild violation, but it's not severe.
Q-Q Plot: This plot assesses normality of residuals. The residuals follow the diagonal line quite closely, especially in the middle section. The deviations at the tails suggest some slight non-normality, but again, it's not drastically off. For a Gradient Boosting model, perfect normality isn't strictly required.
Scale-Location Plot: This plot provides another view of homoscedasticity. The relatively flat line indicates a fairly constant variance of residuals across fitted values, supporting the observations from the Residuals vs. Fitted plot.
Cook's Distance: This plot identifies influential data points that disproportionately affect the model's fit. Several points exceed the threshold (4/n), indicating the presence of influential observations. You should investigate these points further. Are they data errors? Or do they represent truly unusual cases? Removing or correcting these points might improve the model's stability.
¶
Feature Impact Analysis¶
- Property GFA - Self Reported (ft²)
- Impact: This is the most influential feature. Larger gross floor areas tend to increase the model's predictions, indicating a higher energy usage intensity (EUI).
- Interior_LivingArea
- Impact: Similar to gross floor area, larger interior living areas are associated with higher EUI, suggesting that more space generally leads to more energy consumption.
- UnfinishedBasementGross
- Impact: This feature also contributes positively to EUI. Larger unfinished basements might be indicative of additional space requiring heating or cooling.
- Imputed_OverallCondition
- Impact: Better overall condition tends to lower EUI, suggesting that well-maintained buildings are more energy-efficient.
- PD Parcel Living Area
- Impact: Like other area-related features, larger parcel living areas are linked to increased EUI.
- Year Built
- Impact: Newer buildings (higher values) generally show lower EUI, reflecting improvements in building standards and energy efficiency over time.
- Parking_Garage
- Impact: The presence of parking garages tends to increase EUI, possibly due to additional lighting and ventilation needs.
- Interior_NumUnits
- Impact: More units within a building correlate with higher EUI, likely due to increased occupancy and resource usage.
- Condition_OverallCondition
- Impact: Similar to imputed condition, better conditions generally reduce EUI, emphasizing the role of maintenance in energy efficiency.
- Primary Property Type - Self Selected_Multifamily Housing
- Impact: Multifamily housing types are associated with specific energy usage patterns, typically showing higher EUI compared to single-family homes.
9. Preliminary Insights and Conclusions¶
# comment written with help of claudia
Key Insights:
XGBoost significantly outperforms linear regression: The substantial improvement in both Mean Squared Error (MSE) and R-squared demonstrates the superiority of XGBoost for this dataset. XGBoost is able to capture the complex relationships in the data that the linear model misses, and does so without needing normally distributed residuals.
Year Built emerges as top predictor: Interestingly, Year Built, which was insignificant in the linear model, is now the most important feature according to XGBoost. This suggests that the relationship between year of construction and EUI might be non-linear, something a tree-based model like XGBoost can capture effectively.
Building type remains important: The "Parking" property type still holds considerable predictive power, confirming the strong influence of building function on energy consumption. Other property types, while present in the feature importance list, now have lesser weights relative to Year Built. This shift in feature importance highlights the difference between linear and tree-based models.
Other influential features: Features like Interior_Fireplaces, Exterior_RoofMaterial_TAR-GRAVEL, FinishedBasementGross, and UnfinishedBasementGross also emerge as influential predictors. These variables likely capture aspects of building characteristics and construction that affect energy efficiency.
Conclusions:
XGBoost is a more suitable model: For this dataset and prediction task, XGBoost clearly provides a better fit and more accurate predictions compared to linear regression. Its ability to handle non-linear relationships and non-normal data makes it a more appropriate choice.
Further exploration of Year Built: It's crucial to investigate the relationship between Year Built and EUI in more detail. Visualizing this relationship could reveal important trends, such as changes in building codes, technologies, or materials over time, that impact energy performance.
Potential for feature engineering: Given the importance of Year Built, consider creating new features based on this variable, such as building age or era of construction. This could enhance model performance further.
Model Performance Analysis:
- The XGBoost model demonstrates significantly better performance than the linear regression model. The lower MSE and higher R-squared indicate a better fit to the data and improved predictive accuracy.
- The feature importance rankings reveal that Year Built is the most influential predictor, followed by building type and other building characteristics.
- The improved performance suggests that the non-linear relationships captured by XGBoost are essential for understanding the factors that drive EUI.
Next steps:
- Hyperparameter tuning: Explore different hyperparameter settings for the XGBoost model to potentially improve performance further. Techniques like grid search or Bayesian optimization can be helpful.
- Feature engineering: Consider creating additional features based on existing ones, particularly related to Year Built and other important variables, to enhance the model's ability to capture relevant patterns.
- Cross-validation: Perform rigorous cross-validation to ensure the model generalizes well to unseen data and to obtain more robust performance estimates.
- Residual analysis: Analyze the residuals of the XGBoost model to identify potential biases or patterns that might suggest areas for improvement.
By focusing on these next steps, you can refine the XGBoost model further and gain a deeper understanding of the complex relationship between building characteristics and energy consumption. This will pave the way for developing targeted strategies to improve energy efficiency in buildings.